Learning with Fewer Labeled Examples

Author

Morty

Published

January 1, 2026

Welcome

Comp Cog Sci, 2026

5 hours, 5 blocks, one EEG dataset, no naps until lunch.

Today’s promise

By 16:00 you will have:

trained your first deep model on real EEG,
watched it fail in two interesting ways,
and learned six different ways to keep it from failing again.

A short confession

A two-year-old learns to recognize a cat from about eight examples.

A standard deep network needs about eight million.

The two-year-old also needs naps. We will not dwell on this.

What this lecture is really about

The branch of deep learning that has been quietly trying, for the last decade, to close that gap.

The textbook name for this branch is learning with fewer labeled examples. It is the title and the topic of Chapter 19 of Murphy’s Probabilistic Machine Learning. Today is essentially that chapter, in code, on EEG.

The problem in one slide

You have a deep model with a million parameters.

You have 30 labeled examples.

What do you do?

Murphy answers this in seven different ways. We will try all seven on the same dataset, before dinner.

The seven strategies (one per row)

Section	Strategy	Block
19.1	Make more from what you have (data augmentation)	2
19.2	Borrow representations (transfer learning)	2, 3
19.3	Use the unlabeled data too (semi-supervised)	3
19.4	Choose what to label (active learning)	5
19.5	Learn how to learn (meta-learning)	5
19.6	Generalize from a handful (few-shot)	4
19.7	Tolerate cheap labels (weakly supervised)	4

Every block today maps to at least one row.

How the day is shaped

Time	What	Ch 19
10:00 - 10:45	Block 1. Why few labels are hard	(motivation)
10:45 - 12:05	Block 2. Augmentation + transfer	19.1, 19.2
12:05 - 13:15	Lunch	-
13:15 - 14:15	Block 3. Embeddings + semi-supervised	19.2, 19.3
14:15 - 15:00	Block 4. Few-shot + noisy labels	19.6, 19.7
15:00 - 15:15	Coffee	-
15:15 - 15:50	Block 5. Active + meta	19.4, 19.5
15:50 - 16:00	Wrap	-

Tools we will use

Google Colab, free tier, no installation.
MNE-Python for EEG.
MOABB for loading the dataset cleanly.
scikit-learn for the simple models.
Braindecode for the deep models, in a tiny PyTorch wrapper.
pyriemann for the strong classical baseline.
sentence-transformers for the text embedding coda in block 3.
PollEv for occasional pulse checks: PollEv.com/mortyn053

If your Colab catches fire, raise your hand. We have backups.

A note on the audience

This is a course in computational cognitive sciences, not a deep learning course.

I will not derive backpropagation. I will not prove convergence. I will sometimes wave my hands.

In return, you will get to see what these methods actually do, on data that a cognitive scientist actually cares about, and form opinions you can defend at dinner.

Block 1

Why Few Labels Are Hard

10:00 - 10:45

The promise

We will train a small model and watch it fail in two ways:

when we feed it very few labels,
when we ask it to generalize across people.

Both failures are versions of the same problem. Both are what the rest of today is for.

The dataset: PhysioNet Motor Imagery

109 subjects, 64-channel EEG, 160 Hz.
Each subject performs imagined movements of the left or right hand.
One of the most studied datasets in brain-computer interfaces.
Today we use 10 subjects to keep the downloads polite.

Schalk et al. 2004. Standard MOABB benchmark dataset.

What is motor imagery?

Imagine clenching your left fist. Without moving.

Now your right.

A small but reliable change happens in your sensorimotor cortex: the alpha and beta rhythms over the contralateral hemisphere get a little weaker. This is event-related desynchronization (ERD).

If we can measure that change, we can in principle decode which hand you imagined.

What we will do in the notebook

Load 10 subjects of left-vs-right imagined hand movement.
Look at one subject’s data.
Build a simple classifier on subject 1.
Failure 1: shrink the labeled set, watch the model collapse.
Failure 2: train on subject 1, test on subject 2, watch the model collapse differently.
Discuss why both failures are the same problem in disguise.

The pipeline is intentionally tiny: bandpower features plus logistic regression. No deep learning yet.

Two predictions

With all of subject 1’s trials, the classifier hits 75 to 85%.
With only 10 trials per class, it sits closer to 60%.
Trained on subject 1, tested on subject 2 with no calibration: 50 to 60%.

Both shortfalls are forms of “not enough labels for what you are trying to do”. The second is the special case where you have zero labels for the person you actually care about.

Open the notebook now

01_single_subject_trap.ipynb

While it downloads (about 5 minutes the first time), keep reading.

A philosophical aside while we wait

Cross-subject generalization is what neuroscience has been trying to do since at least Brodmann.

When a model fails to transfer between two brains, it could be because:

the brains really are different,
the model is too fragile,
or the recording does not align (different electrode positions, different impedances, different days).

Most of the time it is some mixture of all three. The fact that we cannot tell them apart is itself a finding.

Discussion (after notebook)

Pick one:

If your model gets 80% within-subject and 55% across, is that a useful BCI?
Would you rather collect more subjects, or more trials per subject?
For a clinical application, where would you draw the minimum acceptable accuracy?

(Poll on screen.)

Block 1 takeaways

Even small models work if you have enough labels for the right person.
Take the labels away (failure 1) or change the person (failure 2) and the model collapses.
Both failures are forms of the few-labels problem. Chapter 19 is six different fixes for it. We start with two of them after the break.

Block 2

Data Augmentation and Transfer Learning

10:45 - 12:05

Sections 19.1 and 19.2

Two strategies, one block

Augmentation (19.1): when you do not have enough labels, make more from the ones you have. Random transformations that should not change the label.

Transfer learning (19.2): when you do not have enough labels, borrow parameters from a model that was trained on someone with more.

Both are doing the same thing in different vocabularies: injecting a prior into the learning problem.

Why augmentation works

A trial of motor imagery is a sample from a much larger family of acceptable trials. If you shift the time axis by 50 ms, the label is still “left hand”. If you drop a random electrode, the label is still “left hand”.

Augmentation is the act of telling the model “these things should not matter”. It is a way of encoding invariances without writing them down in the model.

The cognitive analog: this is what happens when a child sees a cat in different lighting and concludes it is still a cat. The label is invariant to the lighting transformation.

Augmentations we will try

Time shift: move the trial forward or backward in time by a few samples.
Channel dropout: zero out a few random electrodes.
Gaussian noise: add a small amount of noise.
Mixup: average two trials of the same class.

All four are one-liners in numpy. We will run a learning-curve experiment: with very few labels, does augmentation actually help?

Why transfer learning works

A pretrained model is a model that has already seen a lot of brains.

The features it has learned (filters that pick out beta-band desynchronization over motor cortex, for instance) are not specific to the subject it was trained on. They are properties of the human sensorimotor system.

Fine-tuning is the act of saying “keep most of what you learned, just nudge the last layer to fit this new person”.

The cognitive analog: this is what every adult does when they learn a new language. They do not relearn how to hear formants from scratch.

The model: EEGNet

A small convolutional network designed for EEG.

4 layers, about 2,000 parameters.
Designed by Lawhern et al. (2018).
Standard reference architecture in the BCI literature.
Trains in seconds on a Colab T4.

We will use the implementation from braindecode, which wraps it in a scikit-learn-compatible interface.

Three regimes to compare

For a held-out target subject:

Regime	Pretrain	Fine-tune	What it tests
From scratch	none	only on target	Lower bound
Zero-shot transfer	on N-1 subjects	none	How much generalizes for free
Fine-tuned transfer	on N-1 subjects	on a few target trials	Practical BCI calibration

Plus all of the above, with and without augmentation, on small training sets. That is the full experimental matrix.

What you should see

From scratch with 20 trials: roughly 55%.
Augmentation alone with 20 trials: roughly 60%.
Zero-shot transfer: roughly 55% (the cross-subject gap from block 1).
Fine-tuned transfer with 20 trials: roughly 70%.
Fine-tuned transfer with augmentation, 20 trials: roughly 75%.

Each fix adds a few points. They stack. Most days in BCI research are spent chasing those few points.

Open `02_transfer.ipynb`

While it loads:

think about which augmentations make biological sense for EEG and which do not.
think about whether “pretrained on other humans” is more or less powerful than “pretrained on natural images”.

Block 2 takeaways

Augmentation is the cheapest fix and you should always try it first.
Transfer learning is the second cheapest fix and is now the default in BCI.
Stacking the two often beats either alone by a margin that matters.
None of these are deep learning tricks. They are statements about what is invariant in the data.

Block 3

Embeddings as Priors, and Letting the Unlabeled Data Help

13:15 - 14:15

Sections 19.2 (representation learning) and 19.3 (semi-supervised)

The bigger version of transfer

In block 2 we transferred a model from one set of brains to another.

In this block we transfer a representation. We use a model that already knows what good EEG features look like, freeze it, and only learn the easy part on top.

The frozen model is the prior. The classifier on top is the posterior update. (You will recognize this from Bayesian cognitive science.)

Three feature extractors

Features	Era	Cost
Bandpower (block 1)	1990s	Trivial
Riemannian tangent space	2010s	Cheap
Pretrained EEGNet penultimate layer	2020s	One forward pass

We feed each into the same logistic regression. We see which one wins on small labeled sets.

The punchline, often: Riemannian features from 2012 are still close to state of the art on this dataset. Geometry beats representation, sometimes.

Semi-supervised learning (19.3)

So far we have ignored a lot of free data.

Each subject in PhysioNet has runs of EEG that are not in the train set. We have their labels, but we could pretend we did not. That gives us a semi-supervised problem: a small labeled set and a large unlabeled set, both from the same distribution.

Murphy 19.3 lists several strategies. We will try the simplest: pseudo-labeling.

Pseudo-labeling, in three lines

Train a classifier on the labeled set.
Predict labels on the unlabeled set. Keep the high-confidence predictions.
Add those to the training set. Retrain.

Cognitive analog: this is what every grad student does in their first literature review. You read a few papers carefully (labels), then skim many more (unlabeled), trust your judgment about the easy ones, and keep going.

If your judgment is good enough, this works. If it is not, it amplifies your initial bias. Both outcomes appear in the literature.

What you should see

A learning curve with two lines:

Supervised only.
Supervised plus pseudo-labels from the unlabeled pool.

The second line should be above the first when labels are very few, and should converge as labels grow. That is the canonical signature of a working semi-supervised method.

If it does not work in your run, try a different confidence threshold. The sensitivity to that threshold is itself a teaching moment.

A 10-minute coda: the same trick on text

A short detour to show that “frozen encoder plus simple classifier” is not EEG-specific.

We embed about 200 short text snippets with nomic-ai/nomic-embed-text-v1.5, fit a 1-NN classifier on top, and watch it work with very few labels.

This is a 2024+ open-weight LLM embedding, fully free, runs on T4 in seconds. The recipe is identical to what we just did with EEG.

Block 3 takeaways

A frozen pretrained encoder plus a simple classifier is the strongest first move in 2026.
Geometry-aware features (Riemannian) are still hard to beat on small EEG data.
Pseudo-labeling lets unlabeled data pull weight, but it amplifies whatever bias you already had.
The recipe transfers across modalities. The same idea with nomic-embed works on text.

Block 4

Few-Shot Learning and Tolerating Noisy Labels

14:15 - 15:00

Sections 19.6 and 19.7

Few-shot calibration (19.6)

The chronic problem of BCI: every new user needs a 20-minute calibration session.

What if we could shrink it to one minute?

That is the few-shot learning problem on neural data. Take a model pretrained on N-1 subjects, give it 1, 2, 5, 10, 20 trials per class from a new subject, and plot the curve.

What you should see

A monotonically rising curve with a steep slope at the bottom.

1 trial per class: 55%
5 trials: 65%
20 trials: 75%
50 trials: 80%

The shape of that curve tells you how much calibration is “enough” for your application. For a research demo: 5 trials. For a wheelchair: 50 and counting.

Weakly supervised learning (19.7)

Real labels are messy.

In motor imagery, we cannot actually verify that the subject imagined what they were told to imagine. Some trials are noisy.

In clinical EEG, the gold standard is interrater agreement, which is often 80% at best.

In behavioral data, participants make errors that look like correct trials.

What do you do when your labels are unreliable?

What we will try

We deliberately corrupt 20% of the training labels at random. Then we train two classifiers:

Standard cross-entropy loss.
Cross-entropy with label smoothing (Murphy 19.7.1).

Label smoothing replaces “this is class 1, certainty 100%” with “this is class 1, certainty 90%, class 2 certainty 10%”. It tells the model not to be too confident, which makes it more robust to noisy labels.

We compare the two on a held-out clean test set.

What you should see

The label-smoothed model loses a point or two on the clean baseline (no noise) but gains several points when noise is present.

This is the canonical bias-variance trade you make when your labels are not trustworthy.

Cognitive analog: this is the formal version of “I trust the textbook but not 100%”. Bayesians have been doing this for decades. Deep learning caught up around 2016.

Block 4 takeaways

Few-shot calibration is the most practical version of the few-labels problem in BCI.
Pretraining transforms calibration from “20 minutes” to “1 minute” without changing accuracy much.
When labels are noisy, never let your model be 100% confident. Label smoothing is the cheapest form of regularization that buys you robustness.

Block 5

Active Learning and Meta-Learning

15:15 - 15:50

Sections 19.4 and 19.5

Active learning (19.4)

You have an unlabeled pool. Labeling is expensive (in BCI: a calibration trial). Which examples should you label first?

Murphy 19.4: pick the ones the model is most uncertain about.

We watch the curve climb faster with uncertainty sampling than with random sampling. Hands-on but light.

The cog-sci frame

This is exactly what an adaptive psychophysics protocol does.

QUEST and staircase methods are active learning by another name. The formalism just makes the connection explicit.

A skilled clinical interviewer also does this without thinking about it. The questions that buy you the most information are the ones whose answers you cannot predict in advance.

Meta-learning, in one slide (19.5)

What if the algorithm itself could be learned from a distribution of tasks?

That is meta-learning. The most-cited example is MAML (Finn et al. 2017), which learns an initialization that adapts to new tasks in a few gradient steps.

We will run a tiny MAML demo on episodes built from PhysionetMI subjects (each subject is a “task”). It will not beat fine-tuning by much, but the mechanism is worth seeing once.

Block 5 takeaways

Active learning is just adaptive experimental design with a loss function.
Choosing what to label is often more useful than labeling more.
Meta-learning is the formal version of “learning to learn”, which cog sci has been studying since Harlow’s monkeys.

Wrap

15:50 - 16:00

What we did, in one table

Block	Strategy	Murphy section	Cog-sci analog
1	(problem statement)	-	The data-poor child
2	Augmentation	19.1	Invariances learned through play
2	Transfer learning	19.2	Adult learning a new language
3	Pretrained representations	19.2	Perception built on prior experience
3	Semi-supervised	19.3	Skimming after careful reading
4	Few-shot	19.6	One-shot category learning (Lake et al.)
4	Weakly supervised	19.7	Trusting noisy authorities
5	Active learning	19.4	Adaptive psychophysics
5	Meta-learning	19.5	Harlow’s learning sets

Thank you

Slides and notebooks: github.com/ccs-unilu/few-labels-2026

Questions, complaints, and existential crises: morteza.ansarinia@uni.lu

Other Formats

Welcome

Today’s promise

A short confession

What this lecture is really about

The problem in one slide

The seven strategies (one per row)

How the day is shaped

Tools we will use

A note on the audience

Block 1

Why Few Labels Are Hard

The promise

The dataset: PhysioNet Motor Imagery

What is motor imagery?

What we will do in the notebook

Two predictions

Open the notebook now

A philosophical aside while we wait

Discussion (after notebook)

Block 1 takeaways

Block 2

Data Augmentation and Transfer Learning

Two strategies, one block

Why augmentation works

Augmentations we will try

Why transfer learning works

The model: EEGNet

Three regimes to compare

What you should see

Open 02_transfer.ipynb

Block 2 takeaways

Block 3

Embeddings as Priors, and Letting the Unlabeled Data Help

The bigger version of transfer

Three feature extractors

Semi-supervised learning (19.3)

Pseudo-labeling, in three lines

What you should see

A 10-minute coda: the same trick on text

Block 3 takeaways

Block 4

Few-Shot Learning and Tolerating Noisy Labels

Few-shot calibration (19.6)

What you should see

Weakly supervised learning (19.7)

What we will try

What you should see

Block 4 takeaways

Block 5

Active Learning and Meta-Learning

Active learning (19.4)

The cog-sci frame

Meta-learning, in one slide (19.5)

Block 5 takeaways

Wrap

What we did, in one table

What to read next

Thank you

Open `02_transfer.ipynb`