Block 4. Few-Shot Learning and Tolerating Noisy Labels

Open In Colab

Goal. Two more strategies for the few-labels problem:

Few-shot calibration (19.6): how few target trials do we actually need to make a pretrained model usable on a new subject?
Weakly supervised learning (19.7): what to do when the labels you do have are noisy.

Time. About 45 minutes.

0. Setup

%%capture
!pip install -q moabb==1.1.0 mne==1.7.1 braindecode==0.8.1 skorch==1.0.0

import warnings, os, copy
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mne
import torch

from moabb.datasets import PhysionetMI
from moabb.paradigms import LeftRightImagery

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

from braindecode.models import EEGNetv4
from braindecode import EEGClassifier

mne.set_log_level("WARNING")
np.random.seed(42)
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Setup complete. Device: {device}")

1. Reload data and pretrained model

dataset = PhysionetMI()
paradigm = LeftRightImagery()
subjects = list(range(1, 11))
X, y, metadata = paradigm.get_data(dataset=dataset, subjects=subjects)

le = LabelEncoder()
y_int = le.fit_transform(y)
X = X.astype(np.float32)
n_chans, n_times = X.shape[1], X.shape[2]

target_subject = 1
source_mask = (metadata["subject"] != target_subject).values
target_mask = (metadata["subject"] == target_subject).values

X_target_all, y_target_all = X[target_mask], y_int[target_mask]
print(f"Target (S{target_subject}) has {len(y_target_all)} trials.")

Get the pretrained EEGNet from block 2.

PRETRAINED_PATH = "/content/eegnet_pretrained_block2.pt"

def make_eegnet(epochs=20, lr=0.001, batch=32, label_smoothing=0.0):
    return EEGClassifier(
        EEGNetv4,
        module__n_chans=n_chans,
        module__n_outputs=2,
        module__n_times=n_times,
        criterion=torch.nn.CrossEntropyLoss,
        criterion__label_smoothing=label_smoothing,
        optimizer=torch.optim.AdamW,
        optimizer__lr=lr,
        optimizer__weight_decay=0.01,
        train_split=None,
        batch_size=batch,
        max_epochs=epochs,
        device=device,
        verbose=0,
    )

clf_pretrained = make_eegnet(epochs=30)
clf_pretrained.initialize()
if os.path.exists(PRETRAINED_PATH):
    print("Loading pretrained EEGNet from disk")
    clf_pretrained.load_params(f_params=PRETRAINED_PATH)
else:
    print("No checkpoint found, retraining (2-3 minutes)")
    clf_pretrained.fit(X[source_mask], y_int[source_mask])
    clf_pretrained.save_params(f_params=PRETRAINED_PATH)
print("Pretrained model ready.")

2. Few-shot calibration (Murphy 19.6)

The problem

The chronic complaint of brain-computer interfaces: every new user needs a 20-minute calibration session before the system works for them.

The few-shot question (Murphy 19.6): what is the smallest number of labeled trials per class that still gives acceptable accuracy, given a model that was pretrained on other subjects?

The experiment

For each N in {1, 2, 5, 10, 20, 50}:

Sample N trials per class from subject 1 as the calibration set.
Fine-tune the pretrained model on those N trials.
Evaluate on the rest of subject 1.
Repeat with multiple random seeds for error bars.

def sample_target_trials(X_t, y_t, n_per_class, seed=0):
    rng = np.random.RandomState(seed)
    chosen = []
    for c in np.unique(y_t):
        idx = np.where(y_t == c)[0]
        chosen.extend(rng.choice(idx, size=n_per_class, replace=False))
    chosen = np.array(chosen)
    rest = np.setdiff1d(np.arange(len(y_t)), chosen)
    return X_t[chosen], y_t[chosen], X_t[rest], y_t[rest]


n_per_class_list = [1, 2, 5, 10, 20, 50]
results_fewshot = []

for n in n_per_class_list:
    if 2 * n > len(y_target_all):
        continue
    accs = []
    for seed in range(5):
        X_t_train, y_t_train, X_t_test, y_t_test = sample_target_trials(
            X_target_all, y_target_all, n_per_class=n, seed=seed
        )
        clf_ft = copy.deepcopy(clf_pretrained)
        clf_ft.set_params(max_epochs=15, optimizer__lr=0.0003)
        clf_ft.partial_fit(X_t_train, y_t_train)
        accs.append(clf_ft.score(X_t_test, y_t_test))
    results_fewshot.append({
        "n_per_class": n,
        "mean_acc": np.mean(accs),
        "std_acc": np.std(accs),
    })

results_fewshot = pd.DataFrame(results_fewshot)
print(results_fewshot.to_string(index=False))

The picture

fig, ax = plt.subplots(figsize=(8, 4.5))
ax.errorbar(results_fewshot["n_per_class"], results_fewshot["mean_acc"],
            yerr=results_fewshot["std_acc"], marker="o", capsize=5)
ax.axhline(0.5, ls="--", color="k", alpha=0.5, label="chance")
ax.set_xscale("log")
ax.set_xlabel("Calibration trials per class (log scale)")
ax.set_ylabel(f"Accuracy on rest of S{target_subject}")
ax.set_title("19.6: the few-shot calibration curve")
ax.set_ylim(0.4, 1.0)
ax.legend()
plt.tight_layout()
plt.show()

What this tells you

A monotonic curve with a steep slope at the bottom and saturation at the top. The shape tells you, for your application, how much calibration is “enough”:

For a research demo: 5 trials. About 30 seconds of recording.
For a wheelchair: 50 trials, and you start asking about safety margins.
For a clinical implant: probably none of the above is enough.

The point of the curve is not the absolute numbers. It is that the shape exists, that pretraining bends it favorably, and that the trade-off between calibration cost and accuracy is now an explicit parameter of the system.

Cog-sci framing

This is the formal, parameter-explicit version of “how many examples does a learner need”. Lake et al. (2015) made the same argument with humans on Omniglot characters: people learn new categories from one or a few examples because they bring strong priors. EEGNet, after pretraining on 9 brains, has weaker priors than humans, but the qualitative shape of the curve is the same.

3. Weakly supervised learning: tolerating noisy labels (Murphy 19.7)

The setup

Murphy 19.7 covers the situation where your labels are unreliable. In cognitive science this happens constantly:

Motor imagery: we cannot verify that the subject actually imagined what they were instructed to. Some trials are mislabeled at the source.
Clinical EEG: gold-standard interrater agreement is often only 80%. 20% of “ground truth” is debatable.
Behavioral data: participants press the wrong button. Some “correct” trials are accidents and some “incorrect” ones are confusions.

What do you do when the labels themselves are noisy?

The simplest fix: label smoothing

Standard cross-entropy loss tells the model “this trial is class 1, with 100% certainty”. Label smoothing tells it “this trial is class 1, with 90% certainty, and class 2 with 10% certainty”.

The model never gets infinite gradient pressure to fit any single label, which makes it more robust to wrong labels.

The experiment

We deliberately corrupt some fraction of subject 1’s training labels. We train two models:

Standard cross-entropy (label smoothing = 0).
Label-smoothed cross-entropy (label smoothing = 0.1).

We evaluate both on a clean test set, for a range of noise rates.

def corrupt_labels(y, noise_rate, seed=0):
    rng = np.random.RandomState(seed)
    y_noisy = y.copy()
    n_flip = int(noise_rate * len(y))
    flip_idx = rng.choice(len(y), size=n_flip, replace=False)
    classes = np.unique(y)
    for i in flip_idx:
        y_noisy[i] = classes[1] if y[i] == classes[0] else classes[0]
    return y_noisy


# Use a reasonably sized training pool from subject 1
X_t_train_full, y_t_train_full, X_t_test, y_t_test = sample_target_trials(
    X_target_all, y_target_all, n_per_class=30, seed=0
)

noise_rates = [0.0, 0.1, 0.2, 0.3]
results_noise = []

for nr in noise_rates:
    for label_smooth in [0.0, 0.1]:
        accs = []
        for seed in range(3):
            y_noisy = corrupt_labels(y_t_train_full, nr, seed=seed)
            clf = copy.deepcopy(clf_pretrained)
            clf.set_params(
                max_epochs=15,
                optimizer__lr=0.0003,
                criterion__label_smoothing=label_smooth,
            )
            clf.partial_fit(X_t_train_full, y_noisy)
            # Evaluate on the CLEAN test set
            accs.append(clf.score(X_t_test, y_t_test))
        results_noise.append({
            "noise_rate": nr,
            "label_smoothing": label_smooth,
            "mean_acc": np.mean(accs),
            "std_acc": np.std(accs),
        })

results_noise = pd.DataFrame(results_noise)
print(results_noise.to_string(index=False))

The picture

fig, ax = plt.subplots(figsize=(8, 5))
for ls_val, marker, color in [(0.0, "o", "C0"), (0.1, "s", "C1")]:
    sub = results_noise[results_noise["label_smoothing"] == ls_val]
    label = "standard cross-entropy" if ls_val == 0.0 else "+ label smoothing (0.1)"
    ax.errorbar(sub["noise_rate"], sub["mean_acc"], yerr=sub["std_acc"],
                marker=marker, color=color, capsize=4, label=label)
ax.axhline(0.5, ls="--", color="k", alpha=0.5, label="chance")
ax.set_xlabel("Fraction of training labels corrupted")
ax.set_ylabel(f"Accuracy on CLEAN held-out trials of S{target_subject}")
ax.set_title("19.7: label smoothing buys robustness to noisy labels")
ax.set_ylim(0.4, 0.9)
ax.legend()
plt.tight_layout()
plt.show()

What you should see

At 0% noise, the two models are roughly tied. Label smoothing might cost a hair on clean data because it prevents the model from being confidently right.

As the noise rate climbs, the standard model’s accuracy drops sharply. The label-smoothed model drops more gradually. By 30% noise, the gap is several points and clearly favors smoothing.

This is the canonical bias-variance trade you make when you do not fully trust your labels. Bayesians have been doing this for decades by putting priors on parameters of the noise process. Deep learning caught up around 2016 with the original label smoothing paper (Szegedy et al.).

Cog-sci framing

Label smoothing is the formal, gradient-friendly version of “trust your sources but not absolutely”. A clinician who treats every interrater disagreement as a coin flip rather than a verdict is doing something analogous. So is a Bayesian observer with non-degenerate priors.

There are richer methods for noisy labels (co-teaching, MentorNet, noise-robust losses, mixup of one-hot labels). For a non-technical audience, label smoothing is the right level: one hyperparameter, one line of code, real benefit.

4. Combined picture

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))

ax = axes[0]
ax.errorbar(results_fewshot["n_per_class"], results_fewshot["mean_acc"],
            yerr=results_fewshot["std_acc"], marker="o", capsize=5)
ax.axhline(0.5, ls="--", color="k", alpha=0.5)
ax.set_xscale("log")
ax.set_xlabel("Calibration trials per class")
ax.set_ylabel("Accuracy")
ax.set_title("19.6: few-shot calibration")
ax.set_ylim(0.4, 1.0)

ax = axes[1]
for ls_val, marker, color in [(0.0, "o", "C0"), (0.1, "s", "C1")]:
    sub = results_noise[results_noise["label_smoothing"] == ls_val]
    label = "standard CE" if ls_val == 0.0 else "+ label smoothing"
    ax.errorbar(sub["noise_rate"], sub["mean_acc"], yerr=sub["std_acc"],
                marker=marker, color=color, capsize=4, label=label)
ax.axhline(0.5, ls="--", color="k", alpha=0.5)
ax.set_xlabel("Noise rate in training labels")
ax.set_ylabel("Accuracy on clean test")
ax.set_title("19.7: noisy labels")
ax.set_ylim(0.4, 0.9)
ax.legend()

plt.tight_layout()
plt.show()

Two faces of the same problem (limited high-quality supervision), with two different solutions, both useful, both cheap.

5. Discussion

The few-shot calibration curve saturates well below 100% accuracy. Why does it stop helping past 50 trials? Is the bottleneck the model, the data, or the subject?
We simulated label noise by random flipping. Real label noise in cognitive science data is often systematic: certain trial types are mislabeled more often than others (e.g., short reaction times confused with anticipations). Would label smoothing still help in that case? Why or why not?
At what noise rate would you stop trusting any single trial and instead aggregate trials into bags (multi-instance learning)? This is a different family of weakly-supervised methods, also in Murphy 19.7.

6. Block 4 takeaways

Few-shot calibration is the most practical version of the few-labels problem in BCI.
Pretraining transforms calibration from “20 minutes” to “1 minute” without changing accuracy much.
When labels are noisy, never let your model be 100% confident. Label smoothing is the cheapest form of regularization that buys you robustness.
Sections 19.6 and 19.7 are now done. After block 5 we will have completed all of Chapter 19.