Block 1. Why Few Labels Are Hard

Open In Colab

Goal. Train a small classifier and watch it fail in two ways:

when we feed it very few labels,
when we ask it to generalize to a new person.

Why. Both are versions of the same problem: not enough labels for what you are trying to do. The whole rest of today (Murphy Ch 19) is six different fixes for it.

Time. About 45 minutes, including a 5-minute first-time download.

0. Setup

Run the install cell once. The first time it will take about a minute. After that the runtime caches everything.

%%capture
!pip install -q moabb==1.1.0 mne==1.7.1

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mne

from moabb.datasets import PhysionetMI
from moabb.paradigms import LeftRightImagery

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from scipy.signal import welch

mne.set_log_level("WARNING")
np.random.seed(42)

print("Setup complete.")

1. Load 10 subjects of motor imagery EEG

We use the PhysioNet Motor Imagery dataset through MOABB. MOABB is a thin wrapper that handles all the file loading, epoching, and labeling that you would otherwise spend the morning fighting.

Each trial is a few seconds of EEG recorded while the subject was imagining either a left hand or a right hand movement.

dataset = PhysionetMI()
paradigm = LeftRightImagery()

# This will download about 500 MB the first time. Be patient.
subjects = list(range(1, 11))  # subjects 1 through 10
X, y, metadata = paradigm.get_data(dataset=dataset, subjects=subjects)

print(f"X shape: {X.shape}")            # (n_trials, n_channels, n_times)
print(f"y shape: {y.shape}")
print(f"Unique labels: {np.unique(y)}")
print(f"Subjects available: {sorted(metadata['subject'].unique())}")

The data is now a single big array. metadata tells us which trial belongs to which subject and session, which we will use throughout.

# Sanity check: per-subject trial counts
metadata.groupby("subject").size().describe()

2. What does motor imagery look like?

Before we throw a classifier at the data, look at it.

sfreq = 160  # sampling rate for PhysionetMI
times = np.arange(X.shape[2]) / sfreq

# Pick one trial of each class from subject 1
mask_subj1 = (metadata["subject"] == 1).values
X_subj1 = X[mask_subj1]
y_subj1 = y[mask_subj1]

idx_left = np.where(y_subj1 == "left_hand")[0][0]
idx_right = np.where(y_subj1 == "right_hand")[0][0]

# C3 and C4 sit over left and right motor cortex respectively.
# We use indices that are stable across MOABB versions.
c3_idx, c4_idx = 4, 5

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)
for ax, idx, label in zip(axes, [idx_left, idx_right], ["left_hand", "right_hand"]):
    ax.plot(times, X_subj1[idx, c3_idx], label="C3 (left motor)", alpha=0.8)
    ax.plot(times, X_subj1[idx, c4_idx], label="C4 (right motor)", alpha=0.8)
    ax.set_title(f"Subject 1, imagined {label}")
    ax.set_xlabel("Time (s)")
    ax.legend()
axes[0].set_ylabel("EEG amplitude (uV)")
plt.tight_layout()
plt.show()

You will not see the classification “in the raw” with the naked eye. That is the point: the brain is a noisy place, and the signal we want is buried in oscillations that vary trial to trial.

The standard trick is to look at band power: how much energy is in specific frequency bands during the trial. The two bands that carry motor-imagery information are:

alpha (8 to 13 Hz), specifically the mu rhythm over sensorimotor cortex
beta (13 to 30 Hz), also strongly modulated by movement

When you imagine moving your left hand, alpha and beta over the right motor cortex (channel C4) typically decrease. And vice versa.

3. Extract bandpower features

We turn each trial into a small vector of numbers: the log power in alpha and beta for each channel.

def bandpower_features(X, sfreq=160, bands=(("alpha", 8, 13), ("beta", 13, 30))):
    """
    X: (n_trials, n_channels, n_times)
    Returns: (n_trials, n_channels * n_bands) of log bandpower.
    """
    n_trials, n_channels, n_times = X.shape
    n_bands = len(bands)
    features = np.zeros((n_trials, n_channels * n_bands))

    for t in range(n_trials):
        for c in range(n_channels):
            f, psd = welch(X[t, c], fs=sfreq, nperseg=min(256, n_times))
            for b, (_, lo, hi) in enumerate(bands):
                mask = (f >= lo) & (f <= hi)
                features[t, c * n_bands + b] = np.log(psd[mask].mean() + 1e-12)
    return features

# Compute for all trials. Takes about 30 seconds.
features = bandpower_features(X, sfreq=sfreq)
print(f"Feature matrix shape: {features.shape}")

Each row is now a single trial described by a few hundred numbers. Easy food for any classifier.

4. The “everything works” baseline

We start with the easy case: train and test on subject 1 with all of their available labels.

features_subj1 = features[mask_subj1]
y_subj1 = y[mask_subj1]

print(f"Subject 1 has {len(y_subj1)} trials.")
print(f"Class balance: {dict(zip(*np.unique(y_subj1, return_counts=True)))}")

Build a simple pipeline (standardize then logistic regression) and cross-validate within subject 1.

pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000, C=1.0)
)

baseline_scores = cross_val_score(pipe, features_subj1, y_subj1, cv=5)
print(f"Within-subject 5-fold accuracy: {baseline_scores.mean():.3f} "
      f"(std {baseline_scores.std():.3f})")

You should see something in the range of 0.70 to 0.85. For motor imagery on a single subject, with no spatial filtering and no fancy tricks, this is a perfectly respectable number. Hold this in your head as the upper bound for the rest of the notebook.

5. Failure mode 1: not enough labels

What if we did not have all those trials? What if all we had were 5, or 10, or 20 labels per class?

This is the few-labels problem in its purest form. We take subject 1 (no transfer, no shift), randomly subsample to N trials per class, and fit the same pipeline. We repeat each subsample size 20 times with different random seeds, so we get error bars.

def few_label_score(features_pool, y_pool, n_per_class, n_seeds=20, cv=3):
    """
    Subsample n_per_class trials per class, fit pipe, cross-validate.
    Returns mean and std accuracy across n_seeds runs.
    """
    accs = []
    classes = np.unique(y_pool)
    for seed in range(n_seeds):
        rng = np.random.RandomState(seed)
        chosen = []
        for cls in classes:
            idx = np.where(y_pool == cls)[0]
            if n_per_class > len(idx):
                return np.nan, np.nan  # not enough data, skip
            chosen.extend(rng.choice(idx, size=n_per_class, replace=False))
        chosen = np.array(chosen)

        pipe = make_pipeline(StandardScaler(),
                             LogisticRegression(max_iter=1000, C=1.0))
        scores = cross_val_score(pipe, features_pool[chosen], y_pool[chosen],
                                 cv=cv)
        accs.append(scores.mean())
    return np.mean(accs), np.std(accs)


# Sweep across label budgets
n_per_class_list = [5, 10, 20, 40, 60]
results = []
for n in n_per_class_list:
    mean_acc, std_acc = few_label_score(features_subj1, y_subj1, n)
    results.append((n, mean_acc, std_acc))
    print(f"n={n:3d} per class: {mean_acc:.3f} +/- {std_acc:.3f}")

Plot the learning curve.

ns = [r[0] for r in results]
means = [r[1] for r in results]
stds = [r[2] for r in results]

fig, ax = plt.subplots(figsize=(7, 4))
ax.errorbar(ns, means, yerr=stds, marker="o", capsize=4)
ax.axhline(0.5, ls="--", color="k", alpha=0.5, label="chance")
ax.axhline(baseline_scores.mean(), ls=":", color="C2", alpha=0.7,
           label=f"all labels ({baseline_scores.mean():.2f})")
ax.set_xlabel("Trials per class in training set")
ax.set_ylabel("Accuracy (3-fold CV, mean over 20 seeds)")
ax.set_title("Failure mode 1: not enough labels")
ax.set_ylim(0.4, 1.0)
ax.legend()
plt.tight_layout()
plt.show()

You should see a clear monotonic curve: more labels means better accuracy. With 5 trials per class the model is barely above chance. With 60 it approaches the all-labels baseline.

This is the central problem of Chapter 19. Every method we will see today is a different way to make the curve climb faster.

6. Failure mode 2: not the right person

Now the second flavor of “not enough labels”. Same model, all of subject 1’s labels available, but we ask it to predict on subject 2 with zero calibration.

Your turn. Fill in the blanks:

# Get features and labels for subject 2
mask_subj2 = (metadata["subject"] == 2).values
features_subj2 = features[mask_subj2]
y_subj2 = y[mask_subj2]

# YOUR CODE HERE:
# 1. Fit `pipe` on subject 1's features and labels.
# 2. Score it on subject 2's features and labels.
# Hint: pipe.fit(X, y), then pipe.score(X_test, y_test).

cross_acc = ...   # your accuracy here

print(f"Cross-subject accuracy (1 -> 2): {cross_acc:.3f}")

Once that runs, sweep across all 10 subjects.

results_cross = []
for source in subjects:
    src_mask = (metadata["subject"] == source).values
    pipe = make_pipeline(StandardScaler(),
                         LogisticRegression(max_iter=1000, C=1.0))
    pipe.fit(features[src_mask], y[src_mask])
    for target in subjects:
        if target == source:
            continue
        tgt_mask = (metadata["subject"] == target).values
        acc = pipe.score(features[tgt_mask], y[tgt_mask])
        results_cross.append({"source": source, "target": target, "acc": acc})

results_df = pd.DataFrame(results_cross)
print(f"\nMean cross-subject accuracy: {results_df['acc'].mean():.3f}")
print(f"Best pair:  {results_df['acc'].max():.3f}")
print(f"Worst pair: {results_df['acc'].min():.3f}")

7. The picture: two failures, one problem

within_means = []
for s in subjects:
    mask = (metadata["subject"] == s).values
    pipe = make_pipeline(StandardScaler(),
                         LogisticRegression(max_iter=1000, C=1.0))
    scores = cross_val_score(pipe, features[mask], y[mask], cv=5)
    within_means.append(scores.mean())

cross_means = results_df.groupby("source")["acc"].mean().values

fig, ax = plt.subplots(figsize=(9, 5))
x = np.arange(len(subjects))
width = 0.35
ax.bar(x - width/2, within_means, width, label="within-subject (all labels)")
ax.bar(x + width/2, cross_means, width, label="cross-subject (zero calibration)")
ax.axhline(0.5, color="k", linestyle="--", alpha=0.5, label="chance")
ax.set_xticks(x)
ax.set_xticklabels([f"S{s}" for s in subjects])
ax.set_ylabel("Accuracy")
ax.set_title("Two failure modes of the few-labels problem")
ax.set_ylim(0.4, 1.0)
ax.legend()
plt.tight_layout()
plt.show()

If you got here, the picture should look something like this:

Within-subject bars sit between 0.65 and 0.85.
Cross-subject bars hover near 0.55.

The first gap (chance to within) is what good features and good models buy you. The second gap (within to cross) is what the rest of today is trying to close.

8. Discussion

While the room compares numbers, think about these:

The cross-subject classifier is doing better than chance, on average. What is it picking up that does generalize across people?
The few-labels curve in section 5 looks similar to the cross-subject gap in section 7. In what sense are they the same problem?
Suppose you had ten minutes of new data from a new user. Would you rather use it to (a) fine-tune the pretrained model, (b) retrain from scratch, or (c) pick the most similar existing subject and use their model? We will come back to this in block 4.

9. What just happened, in one sentence

We built a model that achieved 80% with all of subject 1’s labels, 60% with only ten of subject 1’s labels, and 55% with all of subject 1’s labels but tested on subject 2.

All three numbers are evidence of the same disease: deep models need many labels of exactly the right kind to work well, and we rarely have them.

Chapter 19 is six different cures. After lunch, we start with two of them.