Learning with Fewer Labeled Examples

Author

Computational Cognitive Sciences, University of Luxembourg

Published

May 1, 2026

Welcome

10:00 - 17:00, three slots with two breaks for lunch and coffee.

By 17:00 you will have:

trained a deep model on real EEG,
watched it fail in two different ways,
and learned seven strategies for keeping it from failing again.

A two-year-old learns to recognize a cat from about eight examples.

A standard deep network needs about eight million.

What this lecture is really about

The branch of deep learning that has been quietly trying, for the last decade, to close that gap.

The textbook name for this branch is learning with fewer labeled examples. It is the title and the topic of Chapter 19 of the PML book (Murphy 2022).

Today is essentially that chapter, in code, on brain data — reorganized around what resource each strategy exploits.

The problem

You have a model with a million parameters.

You have 30 labeled examples.

What do you do?

The PML book answers this in seven different ways.

The seven strategies, by resource

Resource	Strategy	PML §	Notebook
Your own data	Augmentation	19.1	single
Your own data	Better features (Riemannian)	19.2	single
Your own data	Semi-supervised (pseudo-labels)	19.3	single
Your own data	Weakly supervised (smoothing)	19.7	single
Other people’s data	Transfer learning	19.2	transfer
Other people’s data	Few-shot calibration	19.6	transfer
Smarter strategy	Active learning	19.4	active
Smarter strategy	Meta-learning (MAML)	19.5	active

How the day is shaped

Time	Notebook	Theme
10:00 - 12:00	`01_single.ipynb`	Working with what you have
12:00 - 13:00	Break
13:00 - 15:00	`02_transfer.ipynb`	Help from other datasets and tasks
15:00 - 15:30	Break
15:30 - 17:00	`03_active.ipynb`	Asking smarter questions

Tools we will use

Google Colab (or your local Jupyter) for the notebooks
MNE-Python + MOABB for EEG and the dataset
scikit-learn for the simple models
Braindecode for EEGNet
pyriemann for a strong classic baseline
learn2learn for the MAML demo
seaborn for plotting outcomes, not figures
sentence-transformers for the optional text coda

This is a course in computational cognitive sciences, not a deep learning course. We will not derive backpropagation. We will sometimes wave hands.

Working with What You Have

10:00 - 12:00

01_single.ipynb — PML §19.1, §19.2 (frozen view), §19.3, §19.7

The promise

We will train a tiny model on one subject and watch it fail when labels are scarce.

Then we will apply four fixes that all stay inside the same dataset:

Better features (Riemannian tangent space).
Augmentation.
Semi-supervised pseudo-labeling.
Label smoothing for noisy labels.

No transfer, no pretraining — that comes after lunch.

The dataset: PhysioNet Motor Imagery

109 subjects, 64-channel EEG, 160 Hz.
Each subject performs imagined movements of the left or right hand.
Standard MOABB benchmark.
Today: 10 subjects to keep the downloads polite.

Schalk et al. 2004.

What is motor imagery?

Imagine clenching your left fist. Without moving.

Now your right.

A small but reliable change happens in your sensorimotor cortex: the alpha and beta rhythms over the contralateral hemisphere get a little weaker. This is event-related desynchronization (ERD).

If we can measure it, we can in principle decode which hand you imagined.

Failure mode 1: not enough labels

Train and test on subject 1, but only give the classifier 5, 10, 20, 40, or 60 trials per class.

The learning curve is monotonic and steep at the bottom.

This is the central failure of Ch. 19. Every strategy today is a different way to make the curve climb faster.

Why augmentation works

A trial of motor imagery is a sample from a much larger family of acceptable trials. Shift the time axis by 50 ms — still “left hand”. Drop a random electrode — still “left hand”.

Augmentation tells the model “these things should not matter”. It encodes invariances without writing them down.

Cognitive analog: a child concluding a cat under different lighting is still a cat.

Augmentations we will try

Time shift: move the trial by a few samples.
Channel dropout: zero out a few random electrodes.
Gaussian noise: add a small amount.
Mixup: average two trials of the same class.

All four are one-liners in numpy.

Riemannian tangent space

Each trial gives a covariance matrix across channels. Those matrices live on a curved manifold (the cone of symmetric positive-definite matrices). Project to a flat tangent space and a standard classifier can eat them.

Embarrassingly competitive on motor imagery for over a decade. Often beats deep models when labels are very few.

Pseudo-labeling, in three lines

Train on the small labeled set.
Predict on the unlabeled pool. Keep the confident ones.
Add to training. Retrain.

Cognitive analog: every grad student doing a first literature review. Read a few papers carefully, skim many more, trust your judgment on the easy ones.

If your judgment is good, this works. If not, it amplifies your initial bias.

Label smoothing for noisy labels

Real labels are messy. In motor imagery we cannot verify what the subject actually imagined. Clinical EEG interrater agreement caps near 80%.

Standard cross-entropy: “class 1, 100% certain”. Label smoothing: “class 1, 90% certain, class 2 10%”. The model never gets infinite gradient pressure to fit any single label.

We corrupt 10–30% of training labels and watch the smoothed model degrade more gracefully than the standard one.

Open the notebook now

01_single.ipynb

While the data downloads (about 5 minutes the first time), keep reading.

Discussion: within-dataset fixes

Riemannian features from 2012 are still close to state of the art. What does that say about “deep” vs “geometric” representations for EEG?
Mixup blends two trials and their labels. Does that make biological sense for EEG?
Pseudo-labeling worked because the unlabeled trials matched the labeled ones. What if the pool came from a different subject?

(Poll on screen.)

Single-dataset takeaways

The within-subject few-labels curve is the diagnosis. Everything else is a fix.
Better features (Riemannian) and pseudo-labels are the largest wins on this dataset.
Augmentation buys a few points; label smoothing only earns its keep once labels are actually noisy.
None of these techniques require borrowing from anyone else.

Bridge to the afternoon. After lunch we leave the single-dataset world. The classifier still fails — but now for a different reason: the wrong person.

Help from Other Datasets and Tasks

13:00 - 15:00

02_transfer.ipynb — PML §19.2 (transfer + frozen deep view), §19.6

Failure mode 2: not the right person

Same paradigm, different brain. Train on subject 1, test on subject 2 with zero calibration.

Sweep all 10 subjects in both directions. The cross-subject bars sit near 0.55 while within-subject bars sit at 0.65–0.85.

That gap is the second failure of Ch. 19 — and the entire afternoon is about closing it.

The trick: a pretrained model

We train a small EEGNet on 9 subjects. The model now knows what good EEG features look like — they are not specific to any one person, they are properties of the human sensorimotor system.

The frozen model is the prior. Whatever we put on top — a logistic regression, a fine-tuned final layer — is the posterior update. (You will recognize this from Bayesian cognitive science.)

Three feature extractors, one classifier

Features	Era	Cost
Bandpower	1990s	Trivial
Riemannian tangent space	2010s	Cheap
Pretrained EEGNet penultimate	2020s	One forward pass

Same LR on top of each. The pretrained penultimate sits above both hand-engineered baselines. Borrowing pays even when we never let the network touch the target.

Transfer regimes

Regime	Pretrain	Fine-tune
From scratch	none	target only
Zero-shot transfer	9 source subjects	none
Fine-tuned transfer	9 source subjects	20 target trials
augmentation	9 source subjects	20 target + augmented copies

Each fix adds a few points. They stack.

Few-shot calibration

The chronic problem of BCI: every new user needs a 20-minute calibration session. What if we could shrink it to one minute?

Sweep N from {1, 2, 5, 10, 20, 50}. Fine-tune the pretrained EEGNet on each, score on the rest, plot the curve.

The shape tells you what “enough calibration” means for your application: a research demo lives at the steep part; a wheelchair controller does not.

Cog-sci framing: priors

Lake et al. (2015) made the same argument with humans on Omniglot: people learn from one or a few examples because they bring strong priors.

Pretraining on 9 other subjects is the prior. The parameters are borrowed instead of philosophized.

Optional coda: the recipe outside EEG

A 10-minute detour to show that “frozen encoder + tiny classifier” is not EEG-specific.

We embed ~40 DSM-5-style sentences with nomic-embed-text-v1.5 (2024+, open-weight, T4-friendly), project to 2D with UMAP, and watch four clear clusters emerge with no supervision.

Same recipe, different modality.

Discussion: borrowing

Why is “pretrained on 9 brains” only modestly above chance for motor imagery, when “pretrained on ImageNet” is often enough for natural-image tasks?
If you could keep only one — augmentation or transfer — which, and why?
Real label noise is often systematic. Does any of today’s borrowing help with that?

Borrowing takeaways

A frozen pretrained encoder + a simple classifier is the strongest first move in 2026.
Geometry-aware features (Riemannian) are still hard to beat on small EEG data.
Fine-tuning + augmentation usually stack.
Few-shot calibration: pretraining transforms calibration from “20 minutes” to “1 minute” without much accuracy cost.
The recipe transfers across modalities. Same idea with nomic-embed works on text.

Bridge to the last slot. Five strategies covered. After the coffee, the last two — but they target a different question: not what to borrow, but what to ask.

Asking Smarter Questions

15:30 - 17:00

03_active.ipynb — PML §19.4, §19.5

A different kind of failure

So far we asked “how do I make more from what I have?” or “can I borrow from elsewhere?”

Now: “given a fixed labeling budget, can I choose which labels to acquire?”

And: “given many similar tasks, can I learn an algorithm that adapts quickly to each new one?”

Active learning (PML §19.4)

You have an unlabeled pool. Each label costs something (in BCI: a calibration trial). Which examples should you label first?

The classical answer: uncertainty sampling. Train on a small seed, then iteratively pick the unlabeled example the model is least sure about, label it, retrain.

Examples the model is already confident about teach it nothing. Examples on the decision boundary do.

The cog-sci frame

This is exactly what an adaptive psychophysics protocol does.

QUEST (Watson & Pelli, 1983), staircase methods, computerized adaptive testing — all are special cases of “choose the next observation to maximize expected information gain”. PML §19.4 is the formal version.

A skilled clinical interviewer does the same without naming it: the questions that buy the most information are the ones whose answers you cannot predict.

Meta-learning, in one slide (PML §19.5)

What if the algorithm itself could be learned from a distribution of tasks?

MAML (Finn et al., 2017) learns an initialization that adapts to new tasks in a few gradient steps. Each “task” is a subject; episodes split into support (adapt) and query (evaluate).

On motor imagery, MAML usually does not beat well-tuned pretraining by much. The mechanism is what matters; the absolute number is not.

Cog-sci framing: learning sets

“Learning to learn” comes from Harlow’s 1949 paper on rhesus monkeys.

Monkeys solving many similar discrimination problems got faster at solving new ones, eventually after a single trial. He called it a “learning set”.

MAML is the modern, gradient-friendly formalization. The hierarchical-Bayes framing makes the inheritance explicit: the thing learned across tasks is a prior over weights, which functions as the learning set.

Discussion: smarter strategies

A colleague hands you a PhD dataset of 30 EEG trials. Which of the day’s strategies do you reach for first? Why?
Several methods (pseudo-labeling, MAML, augmentation) amplify whatever assumptions the model already had. When is that good, when dangerous?
Pick one cognitive analog we invoked today and argue that it is misleading rather than helpful.

Smarter-strategy takeaways

Active learning is adaptive experimental design with a loss function.
Choosing what to label is often more useful than labeling more.
Meta-learning is the formal version of “learning to learn”.
Every section of PML Ch. 19 now has a result on the same EEG dataset.

Wrap

16:50 - 17:00

The day in one table

PML §	Strategy	Notebook	Cog-sci analog
19.1	Augmentation	single	Invariances learned through play
19.2	Transfer learning	transfer	Adult learning a new language
19.3	Semi-supervised (pseudo-labels)	single	Skimming after careful reading
19.4	Active learning	active	Adaptive psychophysics (QUEST)
19.5	Meta-learning (MAML)	active	Harlow’s learning sets (1949)
19.6	Few-shot calibration	transfer	Lake et al. one-shot category learning
19.7	Weakly supervised (smoothing)	single	Trusting noisy authorities, Bayes-style

Each one is a different way to make a model behave more like a learner that already had a brain before the data arrived. They are re-inventions, in a different vocabulary, of priors.

Thank you

Slides and notebooks: github.com/ccs-unilu/few2026

Questions, complaints, and existential crises: morteza.ansarinia@uni.lu

Other Formats