Learning with Fewer Labeled Examples
Welcome
Comp Cog Sci, 2026
5 hours, 5 blocks, one EEG dataset, no naps until lunch.
Today’s promise
By 16:00 you will have:
- trained your first deep model on real EEG,
- watched it fail in two interesting ways,
- and learned six different ways to keep it from failing again.
A short confession
A two-year-old learns to recognize a cat from about eight examples.
A standard deep network needs about eight million.
The two-year-old also needs naps. We will not dwell on this.
What this lecture is really about
The branch of deep learning that has been quietly trying, for the last decade, to close that gap.
The textbook name for this branch is learning with fewer labeled examples. It is the title and the topic of Chapter 19 of Murphy’s Probabilistic Machine Learning. Today is essentially that chapter, in code, on EEG.
The problem in one slide
You have a deep model with a million parameters.
You have 30 labeled examples.
What do you do?
Murphy answers this in seven different ways. We will try all seven on the same dataset, before dinner.
The seven strategies (one per row)
| Section | Strategy | Block |
|---|---|---|
| 19.1 | Make more from what you have (data augmentation) | 2 |
| 19.2 | Borrow representations (transfer learning) | 2, 3 |
| 19.3 | Use the unlabeled data too (semi-supervised) | 3 |
| 19.4 | Choose what to label (active learning) | 5 |
| 19.5 | Learn how to learn (meta-learning) | 5 |
| 19.6 | Generalize from a handful (few-shot) | 4 |
| 19.7 | Tolerate cheap labels (weakly supervised) | 4 |
Every block today maps to at least one row.
How the day is shaped
| Time | What | Ch 19 |
|---|---|---|
| 10:00 - 10:45 | Block 1. Why few labels are hard | (motivation) |
| 10:45 - 12:05 | Block 2. Augmentation + transfer | 19.1, 19.2 |
| 12:05 - 13:15 | Lunch | - |
| 13:15 - 14:15 | Block 3. Embeddings + semi-supervised | 19.2, 19.3 |
| 14:15 - 15:00 | Block 4. Few-shot + noisy labels | 19.6, 19.7 |
| 15:00 - 15:15 | Coffee | - |
| 15:15 - 15:50 | Block 5. Active + meta | 19.4, 19.5 |
| 15:50 - 16:00 | Wrap | - |
Tools we will use
- Google Colab, free tier, no installation.
- MNE-Python for EEG.
- MOABB for loading the dataset cleanly.
- scikit-learn for the simple models.
- Braindecode for the deep models, in a tiny PyTorch wrapper.
- pyriemann for the strong classical baseline.
- sentence-transformers for the text embedding coda in block 3.
- PollEv for occasional pulse checks: PollEv.com/mortyn053
If your Colab catches fire, raise your hand. We have backups.
A note on the audience
This is a course in computational cognitive sciences, not a deep learning course.
I will not derive backpropagation. I will not prove convergence. I will sometimes wave my hands.
In return, you will get to see what these methods actually do, on data that a cognitive scientist actually cares about, and form opinions you can defend at dinner.
Block 1
Why Few Labels Are Hard
10:00 - 10:45
The promise
We will train a small model and watch it fail in two ways:
- when we feed it very few labels,
- when we ask it to generalize across people.
Both failures are versions of the same problem. Both are what the rest of today is for.
The dataset: PhysioNet Motor Imagery
- 109 subjects, 64-channel EEG, 160 Hz.
- Each subject performs imagined movements of the left or right hand.
- One of the most studied datasets in brain-computer interfaces.
- Today we use 10 subjects to keep the downloads polite.
Schalk et al. 2004. Standard MOABB benchmark dataset.
What is motor imagery?
Imagine clenching your left fist. Without moving.
Now your right.
A small but reliable change happens in your sensorimotor cortex: the alpha and beta rhythms over the contralateral hemisphere get a little weaker. This is event-related desynchronization (ERD).
If we can measure that change, we can in principle decode which hand you imagined.
What we will do in the notebook
- Load 10 subjects of left-vs-right imagined hand movement.
- Look at one subject’s data.
- Build a simple classifier on subject 1.
- Failure 1: shrink the labeled set, watch the model collapse.
- Failure 2: train on subject 1, test on subject 2, watch the model collapse differently.
- Discuss why both failures are the same problem in disguise.
The pipeline is intentionally tiny: bandpower features plus logistic regression. No deep learning yet.
Two predictions
- With all of subject 1’s trials, the classifier hits 75 to 85%.
- With only 10 trials per class, it sits closer to 60%.
- Trained on subject 1, tested on subject 2 with no calibration: 50 to 60%.
Both shortfalls are forms of “not enough labels for what you are trying to do”. The second is the special case where you have zero labels for the person you actually care about.
Open the notebook now
01_single_subject_trap.ipynb
While it downloads (about 5 minutes the first time), keep reading.
A philosophical aside while we wait
Cross-subject generalization is what neuroscience has been trying to do since at least Brodmann.
When a model fails to transfer between two brains, it could be because:
- the brains really are different,
- the model is too fragile,
- or the recording does not align (different electrode positions, different impedances, different days).
Most of the time it is some mixture of all three. The fact that we cannot tell them apart is itself a finding.
Discussion (after notebook)
Pick one:
- If your model gets 80% within-subject and 55% across, is that a useful BCI?
- Would you rather collect more subjects, or more trials per subject?
- For a clinical application, where would you draw the minimum acceptable accuracy?
(Poll on screen.)
Block 1 takeaways
- Even small models work if you have enough labels for the right person.
- Take the labels away (failure 1) or change the person (failure 2) and the model collapses.
- Both failures are forms of the few-labels problem. Chapter 19 is six different fixes for it. We start with two of them after the break.
Block 2
Data Augmentation and Transfer Learning
10:45 - 12:05
Sections 19.1 and 19.2
Two strategies, one block
Augmentation (19.1): when you do not have enough labels, make more from the ones you have. Random transformations that should not change the label.
Transfer learning (19.2): when you do not have enough labels, borrow parameters from a model that was trained on someone with more.
Both are doing the same thing in different vocabularies: injecting a prior into the learning problem.
Why augmentation works
A trial of motor imagery is a sample from a much larger family of acceptable trials. If you shift the time axis by 50 ms, the label is still “left hand”. If you drop a random electrode, the label is still “left hand”.
Augmentation is the act of telling the model “these things should not matter”. It is a way of encoding invariances without writing them down in the model.
The cognitive analog: this is what happens when a child sees a cat in different lighting and concludes it is still a cat. The label is invariant to the lighting transformation.
Augmentations we will try
- Time shift: move the trial forward or backward in time by a few samples.
- Channel dropout: zero out a few random electrodes.
- Gaussian noise: add a small amount of noise.
- Mixup: average two trials of the same class.
All four are one-liners in numpy. We will run a learning-curve experiment: with very few labels, does augmentation actually help?
Why transfer learning works
A pretrained model is a model that has already seen a lot of brains.
The features it has learned (filters that pick out beta-band desynchronization over motor cortex, for instance) are not specific to the subject it was trained on. They are properties of the human sensorimotor system.
Fine-tuning is the act of saying “keep most of what you learned, just nudge the last layer to fit this new person”.
The cognitive analog: this is what every adult does when they learn a new language. They do not relearn how to hear formants from scratch.
The model: EEGNet
A small convolutional network designed for EEG.
- 4 layers, about 2,000 parameters.
- Designed by Lawhern et al. (2018).
- Standard reference architecture in the BCI literature.
- Trains in seconds on a Colab T4.
We will use the implementation from braindecode, which wraps it in a scikit-learn-compatible interface.
Three regimes to compare
For a held-out target subject:
| Regime | Pretrain | Fine-tune | What it tests |
|---|---|---|---|
| From scratch | none | only on target | Lower bound |
| Zero-shot transfer | on N-1 subjects | none | How much generalizes for free |
| Fine-tuned transfer | on N-1 subjects | on a few target trials | Practical BCI calibration |
Plus all of the above, with and without augmentation, on small training sets. That is the full experimental matrix.
What you should see
- From scratch with 20 trials: roughly 55%.
- Augmentation alone with 20 trials: roughly 60%.
- Zero-shot transfer: roughly 55% (the cross-subject gap from block 1).
- Fine-tuned transfer with 20 trials: roughly 70%.
- Fine-tuned transfer with augmentation, 20 trials: roughly 75%.
Each fix adds a few points. They stack. Most days in BCI research are spent chasing those few points.
Open 02_transfer.ipynb
While it loads:
- think about which augmentations make biological sense for EEG and which do not.
- think about whether “pretrained on other humans” is more or less powerful than “pretrained on natural images”.
Block 2 takeaways
- Augmentation is the cheapest fix and you should always try it first.
- Transfer learning is the second cheapest fix and is now the default in BCI.
- Stacking the two often beats either alone by a margin that matters.
- None of these are deep learning tricks. They are statements about what is invariant in the data.
Block 3
Embeddings as Priors, and Letting the Unlabeled Data Help
13:15 - 14:15
Sections 19.2 (representation learning) and 19.3 (semi-supervised)
The bigger version of transfer
In block 2 we transferred a model from one set of brains to another.
In this block we transfer a representation. We use a model that already knows what good EEG features look like, freeze it, and only learn the easy part on top.
The frozen model is the prior. The classifier on top is the posterior update. (You will recognize this from Bayesian cognitive science.)
Three feature extractors
| Features | Era | Cost |
|---|---|---|
| Bandpower (block 1) | 1990s | Trivial |
| Riemannian tangent space | 2010s | Cheap |
| Pretrained EEGNet penultimate layer | 2020s | One forward pass |
We feed each into the same logistic regression. We see which one wins on small labeled sets.
The punchline, often: Riemannian features from 2012 are still close to state of the art on this dataset. Geometry beats representation, sometimes.
Semi-supervised learning (19.3)
So far we have ignored a lot of free data.
Each subject in PhysioNet has runs of EEG that are not in the train set. We have their labels, but we could pretend we did not. That gives us a semi-supervised problem: a small labeled set and a large unlabeled set, both from the same distribution.
Murphy 19.3 lists several strategies. We will try the simplest: pseudo-labeling.
Pseudo-labeling, in three lines
- Train a classifier on the labeled set.
- Predict labels on the unlabeled set. Keep the high-confidence predictions.
- Add those to the training set. Retrain.
Cognitive analog: this is what every grad student does in their first literature review. You read a few papers carefully (labels), then skim many more (unlabeled), trust your judgment about the easy ones, and keep going.
If your judgment is good enough, this works. If it is not, it amplifies your initial bias. Both outcomes appear in the literature.
What you should see
A learning curve with two lines:
- Supervised only.
- Supervised plus pseudo-labels from the unlabeled pool.
The second line should be above the first when labels are very few, and should converge as labels grow. That is the canonical signature of a working semi-supervised method.
If it does not work in your run, try a different confidence threshold. The sensitivity to that threshold is itself a teaching moment.
A 10-minute coda: the same trick on text
A short detour to show that “frozen encoder plus simple classifier” is not EEG-specific.
We embed about 200 short text snippets with nomic-ai/nomic-embed-text-v1.5, fit a 1-NN classifier on top, and watch it work with very few labels.
This is a 2024+ open-weight LLM embedding, fully free, runs on T4 in seconds. The recipe is identical to what we just did with EEG.
Block 3 takeaways
- A frozen pretrained encoder plus a simple classifier is the strongest first move in 2026.
- Geometry-aware features (Riemannian) are still hard to beat on small EEG data.
- Pseudo-labeling lets unlabeled data pull weight, but it amplifies whatever bias you already had.
- The recipe transfers across modalities. The same idea with
nomic-embedworks on text.
Block 4
Few-Shot Learning and Tolerating Noisy Labels
14:15 - 15:00
Sections 19.6 and 19.7
Few-shot calibration (19.6)
The chronic problem of BCI: every new user needs a 20-minute calibration session.
What if we could shrink it to one minute?
That is the few-shot learning problem on neural data. Take a model pretrained on N-1 subjects, give it 1, 2, 5, 10, 20 trials per class from a new subject, and plot the curve.
What you should see
A monotonically rising curve with a steep slope at the bottom.
- 1 trial per class: 55%
- 5 trials: 65%
- 20 trials: 75%
- 50 trials: 80%
The shape of that curve tells you how much calibration is “enough” for your application. For a research demo: 5 trials. For a wheelchair: 50 and counting.
Weakly supervised learning (19.7)
Real labels are messy.
In motor imagery, we cannot actually verify that the subject imagined what they were told to imagine. Some trials are noisy.
In clinical EEG, the gold standard is interrater agreement, which is often 80% at best.
In behavioral data, participants make errors that look like correct trials.
What do you do when your labels are unreliable?
What we will try
We deliberately corrupt 20% of the training labels at random. Then we train two classifiers:
- Standard cross-entropy loss.
- Cross-entropy with label smoothing (Murphy 19.7.1).
Label smoothing replaces “this is class 1, certainty 100%” with “this is class 1, certainty 90%, class 2 certainty 10%”. It tells the model not to be too confident, which makes it more robust to noisy labels.
We compare the two on a held-out clean test set.
What you should see
The label-smoothed model loses a point or two on the clean baseline (no noise) but gains several points when noise is present.
This is the canonical bias-variance trade you make when your labels are not trustworthy.
Cognitive analog: this is the formal version of “I trust the textbook but not 100%”. Bayesians have been doing this for decades. Deep learning caught up around 2016.
Block 4 takeaways
- Few-shot calibration is the most practical version of the few-labels problem in BCI.
- Pretraining transforms calibration from “20 minutes” to “1 minute” without changing accuracy much.
- When labels are noisy, never let your model be 100% confident. Label smoothing is the cheapest form of regularization that buys you robustness.
Block 5
Active Learning and Meta-Learning
15:15 - 15:50
Sections 19.4 and 19.5
Active learning (19.4)
You have an unlabeled pool. Labeling is expensive (in BCI: a calibration trial). Which examples should you label first?
Murphy 19.4: pick the ones the model is most uncertain about.
We watch the curve climb faster with uncertainty sampling than with random sampling. Hands-on but light.
The cog-sci frame
This is exactly what an adaptive psychophysics protocol does.
QUEST and staircase methods are active learning by another name. The formalism just makes the connection explicit.
A skilled clinical interviewer also does this without thinking about it. The questions that buy you the most information are the ones whose answers you cannot predict in advance.
Meta-learning, in one slide (19.5)
What if the algorithm itself could be learned from a distribution of tasks?
That is meta-learning. The most-cited example is MAML (Finn et al. 2017), which learns an initialization that adapts to new tasks in a few gradient steps.
We will run a tiny MAML demo on episodes built from PhysionetMI subjects (each subject is a “task”). It will not beat fine-tuning by much, but the mechanism is worth seeing once.
Block 5 takeaways
- Active learning is just adaptive experimental design with a loss function.
- Choosing what to label is often more useful than labeling more.
- Meta-learning is the formal version of “learning to learn”, which cog sci has been studying since Harlow’s monkeys.
Wrap
15:50 - 16:00
What we did, in one table
| Block | Strategy | Murphy section | Cog-sci analog |
|---|---|---|---|
| 1 | (problem statement) | - | The data-poor child |
| 2 | Augmentation | 19.1 | Invariances learned through play |
| 2 | Transfer learning | 19.2 | Adult learning a new language |
| 3 | Pretrained representations | 19.2 | Perception built on prior experience |
| 3 | Semi-supervised | 19.3 | Skimming after careful reading |
| 4 | Few-shot | 19.6 | One-shot category learning (Lake et al.) |
| 4 | Weakly supervised | 19.7 | Trusting noisy authorities |
| 5 | Active learning | 19.4 | Adaptive psychophysics |
| 5 | Meta-learning | 19.5 | Harlow’s learning sets |
What to read next
- Murphy, Probabilistic Machine Learning: An Introduction, Chapter 19.
- Lake, Salakhutdinov, Tenenbaum (2015). “Human-level concept learning through probabilistic program induction.” Science.
- Lake, Ullman, Tenenbaum, Gershman (2017). “Building machines that learn and think like people.” BBS.
- Schirrmeister et al. (2017). “Deep learning with convolutional neural networks for EEG decoding and visualization.” Human Brain Mapping.
- Yamins & DiCarlo (2016). “Using goal-driven deep learning models to understand sensory cortex.” Nature Neuroscience.
For the productivity-curious: Anne-Laure Le Cunff, Tiny Experiments. Different topic, same energy.
Thank you
Slides and notebooks: github.com/ccs-unilu/few-labels-2026
Questions, complaints, and existential crises: morteza.ansarinia@uni.lu