Learning with Fewer Labeled Examples
Welcome
10:00 - 17:00, three slots with two breaks for lunch and coffee.
By 17:00 you will have:
- trained a deep model on real EEG,
- watched it fail in two different ways,
- and learned seven strategies for keeping it from failing again.
A two-year-old learns to recognize a cat from about eight examples.
A standard deep network needs about eight million.
What this lecture is really about
The branch of deep learning that has been quietly trying, for the last decade, to close that gap.
The textbook name for this branch is learning with fewer labeled examples. It is the title and the topic of Chapter 19 of the PML book (Murphy 2022).
Today is essentially that chapter, in code, on brain data — reorganized around what resource each strategy exploits.
The problem
You have a model with a million parameters.
You have 30 labeled examples.
What do you do?
The PML book answers this in seven different ways.
The seven strategies, by resource
| Resource | Strategy | PML § | Notebook |
|---|---|---|---|
| Your own data | Augmentation | 19.1 | single |
| Your own data | Better features (Riemannian) | 19.2 | single |
| Your own data | Semi-supervised (pseudo-labels) | 19.3 | single |
| Your own data | Weakly supervised (smoothing) | 19.7 | single |
| Other people’s data | Transfer learning | 19.2 | transfer |
| Other people’s data | Few-shot calibration | 19.6 | transfer |
| Smarter strategy | Active learning | 19.4 | active |
| Smarter strategy | Meta-learning (MAML) | 19.5 | active |
How the day is shaped
| Time | Notebook | Theme |
|---|---|---|
| 10:00 - 12:00 | 01_single.ipynb |
Working with what you have |
| 12:00 - 13:00 | Break | |
| 13:00 - 15:00 | 02_transfer.ipynb |
Help from other datasets and tasks |
| 15:00 - 15:30 | Break | |
| 15:30 - 17:00 | 03_active.ipynb |
Asking smarter questions |
Tools we will use
- Google Colab (or your local Jupyter) for the notebooks
- MNE-Python + MOABB for EEG and the dataset
- scikit-learn for the simple models
- Braindecode for EEGNet
- pyriemann for a strong classic baseline
- learn2learn for the MAML demo
- seaborn for plotting outcomes, not figures
- sentence-transformers for the optional text coda
This is a course in computational cognitive sciences, not a deep learning course. We will not derive backpropagation. We will sometimes wave hands.
Working with What You Have
10:00 - 12:00
01_single.ipynb — PML §19.1, §19.2 (frozen view), §19.3, §19.7
The promise
We will train a tiny model on one subject and watch it fail when labels are scarce.
Then we will apply four fixes that all stay inside the same dataset:
- Better features (Riemannian tangent space).
- Augmentation.
- Semi-supervised pseudo-labeling.
- Label smoothing for noisy labels.
No transfer, no pretraining — that comes after lunch.
The dataset: PhysioNet Motor Imagery
- 109 subjects, 64-channel EEG, 160 Hz.
- Each subject performs imagined movements of the left or right hand.
- Standard MOABB benchmark.
- Today: 10 subjects to keep the downloads polite.
Schalk et al. 2004.
What is motor imagery?
Imagine clenching your left fist. Without moving.
Now your right.
A small but reliable change happens in your sensorimotor cortex: the alpha and beta rhythms over the contralateral hemisphere get a little weaker. This is event-related desynchronization (ERD).
If we can measure it, we can in principle decode which hand you imagined.
Failure mode 1: not enough labels
Train and test on subject 1, but only give the classifier 5, 10, 20, 40, or 60 trials per class.
The learning curve is monotonic and steep at the bottom.
This is the central failure of Ch. 19. Every strategy today is a different way to make the curve climb faster.
Why augmentation works
A trial of motor imagery is a sample from a much larger family of acceptable trials. Shift the time axis by 50 ms — still “left hand”. Drop a random electrode — still “left hand”.
Augmentation tells the model “these things should not matter”. It encodes invariances without writing them down.
Cognitive analog: a child concluding a cat under different lighting is still a cat.
Augmentations we will try
- Time shift: move the trial by a few samples.
- Channel dropout: zero out a few random electrodes.
- Gaussian noise: add a small amount.
- Mixup: average two trials of the same class.
All four are one-liners in numpy.
Riemannian tangent space
Each trial gives a covariance matrix across channels. Those matrices live on a curved manifold (the cone of symmetric positive-definite matrices). Project to a flat tangent space and a standard classifier can eat them.
Embarrassingly competitive on motor imagery for over a decade. Often beats deep models when labels are very few.
Pseudo-labeling, in three lines
- Train on the small labeled set.
- Predict on the unlabeled pool. Keep the confident ones.
- Add to training. Retrain.
Cognitive analog: every grad student doing a first literature review. Read a few papers carefully, skim many more, trust your judgment on the easy ones.
If your judgment is good, this works. If not, it amplifies your initial bias.
Label smoothing for noisy labels
Real labels are messy. In motor imagery we cannot verify what the subject actually imagined. Clinical EEG interrater agreement caps near 80%.
Standard cross-entropy: “class 1, 100% certain”. Label smoothing: “class 1, 90% certain, class 2 10%”. The model never gets infinite gradient pressure to fit any single label.
We corrupt 10–30% of training labels and watch the smoothed model degrade more gracefully than the standard one.
Open the notebook now
01_single.ipynb
While the data downloads (about 5 minutes the first time), keep reading.
Discussion: within-dataset fixes
- Riemannian features from 2012 are still close to state of the art. What does that say about “deep” vs “geometric” representations for EEG?
- Mixup blends two trials and their labels. Does that make biological sense for EEG?
- Pseudo-labeling worked because the unlabeled trials matched the labeled ones. What if the pool came from a different subject?
(Poll on screen.)
Single-dataset takeaways
- The within-subject few-labels curve is the diagnosis. Everything else is a fix.
- Better features (Riemannian) and pseudo-labels are the largest wins on this dataset.
- Augmentation buys a few points; label smoothing only earns its keep once labels are actually noisy.
- None of these techniques require borrowing from anyone else.
Bridge to the afternoon. After lunch we leave the single-dataset world. The classifier still fails — but now for a different reason: the wrong person.
Help from Other Datasets and Tasks
13:00 - 15:00
02_transfer.ipynb — PML §19.2 (transfer + frozen deep view), §19.6
Failure mode 2: not the right person
Same paradigm, different brain. Train on subject 1, test on subject 2 with zero calibration.
Sweep all 10 subjects in both directions. The cross-subject bars sit near 0.55 while within-subject bars sit at 0.65–0.85.
That gap is the second failure of Ch. 19 — and the entire afternoon is about closing it.
The trick: a pretrained model
We train a small EEGNet on 9 subjects. The model now knows what good EEG features look like — they are not specific to any one person, they are properties of the human sensorimotor system.
The frozen model is the prior. Whatever we put on top — a logistic regression, a fine-tuned final layer — is the posterior update. (You will recognize this from Bayesian cognitive science.)
Three feature extractors, one classifier
| Features | Era | Cost |
|---|---|---|
| Bandpower | 1990s | Trivial |
| Riemannian tangent space | 2010s | Cheap |
| Pretrained EEGNet penultimate | 2020s | One forward pass |
Same LR on top of each. The pretrained penultimate sits above both hand-engineered baselines. Borrowing pays even when we never let the network touch the target.
Transfer regimes
| Regime | Pretrain | Fine-tune |
|---|---|---|
| From scratch | none | target only |
| Zero-shot transfer | 9 source subjects | none |
| Fine-tuned transfer | 9 source subjects | 20 target trials |
|
9 source subjects | 20 target + augmented copies |
Each fix adds a few points. They stack.
Few-shot calibration
The chronic problem of BCI: every new user needs a 20-minute calibration session. What if we could shrink it to one minute?
Sweep N from {1, 2, 5, 10, 20, 50}. Fine-tune the pretrained EEGNet on each, score on the rest, plot the curve.
The shape tells you what “enough calibration” means for your application: a research demo lives at the steep part; a wheelchair controller does not.
Cog-sci framing: priors
Lake et al. (2015) made the same argument with humans on Omniglot: people learn from one or a few examples because they bring strong priors.
Pretraining on 9 other subjects is the prior. The parameters are borrowed instead of philosophized.
Optional coda: the recipe outside EEG
A 10-minute detour to show that “frozen encoder + tiny classifier” is not EEG-specific.
We embed ~40 DSM-5-style sentences with nomic-embed-text-v1.5 (2024+, open-weight, T4-friendly), project to 2D with UMAP, and watch four clear clusters emerge with no supervision.
Same recipe, different modality.
Discussion: borrowing
- Why is “pretrained on 9 brains” only modestly above chance for motor imagery, when “pretrained on ImageNet” is often enough for natural-image tasks?
- If you could keep only one — augmentation or transfer — which, and why?
- Real label noise is often systematic. Does any of today’s borrowing help with that?
Borrowing takeaways
- A frozen pretrained encoder + a simple classifier is the strongest first move in 2026.
- Geometry-aware features (Riemannian) are still hard to beat on small EEG data.
- Fine-tuning + augmentation usually stack.
- Few-shot calibration: pretraining transforms calibration from “20 minutes” to “1 minute” without much accuracy cost.
- The recipe transfers across modalities. Same idea with
nomic-embedworks on text.
Bridge to the last slot. Five strategies covered. After the coffee, the last two — but they target a different question: not what to borrow, but what to ask.
Asking Smarter Questions
15:30 - 17:00
03_active.ipynb — PML §19.4, §19.5
A different kind of failure
So far we asked “how do I make more from what I have?” or “can I borrow from elsewhere?”
Now: “given a fixed labeling budget, can I choose which labels to acquire?”
And: “given many similar tasks, can I learn an algorithm that adapts quickly to each new one?”
Active learning (PML §19.4)
You have an unlabeled pool. Each label costs something (in BCI: a calibration trial). Which examples should you label first?
The classical answer: uncertainty sampling. Train on a small seed, then iteratively pick the unlabeled example the model is least sure about, label it, retrain.
Examples the model is already confident about teach it nothing. Examples on the decision boundary do.
The cog-sci frame
This is exactly what an adaptive psychophysics protocol does.
QUEST (Watson & Pelli, 1983), staircase methods, computerized adaptive testing — all are special cases of “choose the next observation to maximize expected information gain”. PML §19.4 is the formal version.
A skilled clinical interviewer does the same without naming it: the questions that buy the most information are the ones whose answers you cannot predict.
Meta-learning, in one slide (PML §19.5)
What if the algorithm itself could be learned from a distribution of tasks?
MAML (Finn et al., 2017) learns an initialization that adapts to new tasks in a few gradient steps. Each “task” is a subject; episodes split into support (adapt) and query (evaluate).
On motor imagery, MAML usually does not beat well-tuned pretraining by much. The mechanism is what matters; the absolute number is not.
Cog-sci framing: learning sets
“Learning to learn” comes from Harlow’s 1949 paper on rhesus monkeys.
Monkeys solving many similar discrimination problems got faster at solving new ones, eventually after a single trial. He called it a “learning set”.
MAML is the modern, gradient-friendly formalization. The hierarchical-Bayes framing makes the inheritance explicit: the thing learned across tasks is a prior over weights, which functions as the learning set.
Discussion: smarter strategies
- A colleague hands you a PhD dataset of 30 EEG trials. Which of the day’s strategies do you reach for first? Why?
- Several methods (pseudo-labeling, MAML, augmentation) amplify whatever assumptions the model already had. When is that good, when dangerous?
- Pick one cognitive analog we invoked today and argue that it is misleading rather than helpful.
Smarter-strategy takeaways
- Active learning is adaptive experimental design with a loss function.
- Choosing what to label is often more useful than labeling more.
- Meta-learning is the formal version of “learning to learn”.
- Every section of PML Ch. 19 now has a result on the same EEG dataset.
Wrap
16:50 - 17:00
The day in one table
| PML § | Strategy | Notebook | Cog-sci analog |
|---|---|---|---|
| 19.1 | Augmentation | single | Invariances learned through play |
| 19.2 | Transfer learning | transfer | Adult learning a new language |
| 19.3 | Semi-supervised (pseudo-labels) | single | Skimming after careful reading |
| 19.4 | Active learning | active | Adaptive psychophysics (QUEST) |
| 19.5 | Meta-learning (MAML) | active | Harlow’s learning sets (1949) |
| 19.6 | Few-shot calibration | transfer | Lake et al. one-shot category learning |
| 19.7 | Weakly supervised (smoothing) | single | Trusting noisy authorities, Bayes-style |
Each one is a different way to make a model behave more like a learner that already had a brain before the data arrived. They are re-inventions, in a different vocabulary, of priors.
What to read next
- The PML book (Murphy 2022), Chapter 19.
- Lake, Salakhutdinov, Tenenbaum (2015). “Human-level concept learning through probabilistic program induction.” Science.
- Lake, Ullman, Tenenbaum, Gershman (2017). “Building machines that learn and think like people.” BBS.
- Schirrmeister et al. (2017). “Deep learning with convolutional neural networks for EEG decoding and visualization.” Human Brain Mapping.
- Yamins & DiCarlo (2016). “Using goal-driven deep learning models to understand sensory cortex.” Nature Neuroscience.
Thank you
Slides and notebooks: github.com/ccs-unilu/few2026
Questions, complaints, and existential crises: morteza.ansarinia@uni.lu