Block 5. Active Learning and Meta-Learning

Open In Colab

Goal. The last two strategies from Chapter 19:

  • Active learning (19.4): choose what to label next.
  • Meta-learning (19.5): learn an algorithm that adapts quickly to new tasks.

Time. About 35 minutes. Lighter touch than blocks 1 to 4. The MAML demo is the most fragile cell of the day. If it does not converge in your session, the slides version explains the idea instead.


0. Setup

%%capture
!pip install -q moabb==1.1.0 mne==1.7.1 braindecode==0.8.1 skorch==1.0.0 \
                learn2learn==0.2.0
import warnings, copy
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mne
import torch
import torch.nn.functional as F

from moabb.datasets import PhysionetMI
from moabb.paradigms import LeftRightImagery

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

from scipy.signal import welch

from braindecode.models import EEGNetv4

mne.set_log_level("WARNING")
np.random.seed(42)
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Setup complete. Device: {device}")

1. Reload data and make features

dataset = PhysionetMI()
paradigm = LeftRightImagery()
subjects = list(range(1, 11))
X, y, metadata = paradigm.get_data(dataset=dataset, subjects=subjects)

le = LabelEncoder()
y_int = le.fit_transform(y)
X = X.astype(np.float32)
n_chans, n_times = X.shape[1], X.shape[2]


def bandpower_features(Xin, sfreq=160):
    bands = (("alpha", 8, 13), ("beta", 13, 30))
    feats = np.zeros((Xin.shape[0], Xin.shape[1] * len(bands)),
                     dtype=np.float32)
    for t in range(Xin.shape[0]):
        for c in range(Xin.shape[1]):
            f, psd = welch(Xin[t, c], fs=sfreq,
                           nperseg=min(256, Xin.shape[2]))
            for b, (_, lo, hi) in enumerate(bands):
                m = (f >= lo) & (f <= hi)
                feats[t, c * len(bands) + b] = np.log(psd[m].mean() + 1e-12)
    return feats

features = bandpower_features(X)
print(f"Features shape: {features.shape}")

2. Active learning (Murphy 19.4)

The problem

You have an unlabeled pool. Each label costs something (in BCI: a calibration trial, which costs the user about 5 seconds plus mental effort). What is the cheapest order in which to ask for labels?

Two strategies

  • Random: shuffle the pool, label the first N. Treats every example as equally informative.
  • Uncertainty sampling: train on a tiny seed, then iteratively pick the unlabeled example the model is least certain about, label it, add it, retrain.

The intuition for uncertainty sampling: examples the model is already confident about are not going to teach it anything new. Examples on the decision boundary are.

def random_acquire(X_pool, y_pool, n_total, seed=0):
    """Random baseline: shuffle and take the first n_total."""
    rng = np.random.RandomState(seed)
    return list(rng.permutation(len(X_pool))[:n_total])


def uncertainty_acquire(X_pool, y_pool, n_initial, n_total, seed=0):
    """Iterative: start random, then add most-uncertain examples."""
    rng = np.random.RandomState(seed)
    selected = list(rng.permutation(len(X_pool))[:n_initial])
    remaining = set(range(len(X_pool))) - set(selected)

    while len(selected) < n_total:
        pipe = make_pipeline(StandardScaler(),
                             LogisticRegression(max_iter=1000, C=1.0))
        pipe.fit(X_pool[selected], y_pool[selected])
        rem_idx = np.array(sorted(remaining))
        proba = pipe.predict_proba(X_pool[rem_idx])
        # Distance to 0.5 = certainty. Lower distance = higher uncertainty.
        uncertainty = 1.0 - np.abs(proba[:, 0] - 0.5) * 2
        next_idx = rem_idx[np.argmax(uncertainty)]
        selected.append(int(next_idx))
        remaining.remove(int(next_idx))
    return selected

The experiment

We run on subject 1. The “pool” is all of subject 1’s trials except the held-out test set. We sweep budget from 5 to 30 labels.

target_subject = 1
mask_target = (metadata["subject"] == target_subject).values

f_target = features[mask_target]
y_target = y_int[mask_target]

# Held-out test set
rng = np.random.RandomState(0)
classes = np.unique(y_target)
test_idx = []
for c in classes:
    idx = np.where(y_target == c)[0]
    test_idx.extend(rng.choice(idx, size=len(idx)//3, replace=False))
test_idx = np.array(test_idx)
pool_idx = np.setdiff1d(np.arange(len(y_target)), test_idx)

f_pool, y_pool = f_target[pool_idx], y_target[pool_idx]
f_test, y_test = f_target[test_idx], y_target[test_idx]

print(f"Pool size: {len(y_pool)}, Test size: {len(y_test)}")
budgets = [5, 10, 15, 20, 25, 30]
results_active = []

for budget in budgets:
    for strategy in ["random", "uncertainty"]:
        accs = []
        for seed in range(10):
            if strategy == "random":
                chosen = random_acquire(f_pool, y_pool, budget, seed=seed)
            else:
                chosen = uncertainty_acquire(f_pool, y_pool,
                                             n_initial=4, n_total=budget,
                                             seed=seed)
            pipe = make_pipeline(StandardScaler(),
                                 LogisticRegression(max_iter=1000, C=1.0))
            pipe.fit(f_pool[chosen], y_pool[chosen])
            accs.append(pipe.score(f_test, y_test))
        results_active.append({
            "budget": budget,
            "strategy": strategy,
            "mean_acc": np.mean(accs),
            "std_acc": np.std(accs),
        })

results_active = pd.DataFrame(results_active)
print(results_active.to_string(index=False))

The picture

fig, ax = plt.subplots(figsize=(8, 4.5))
for strat, marker, color in [("random", "o", "C0"),
                              ("uncertainty", "s", "C1")]:
    sub = results_active[results_active["strategy"] == strat]
    ax.errorbar(sub["budget"], sub["mean_acc"], yerr=sub["std_acc"],
                marker=marker, color=color, capsize=4, label=strat)
ax.axhline(0.5, ls="--", color="k", alpha=0.5, label="chance")
ax.set_xlabel("Label budget")
ax.set_ylabel(f"Accuracy on held-out S{target_subject}")
ax.set_title("19.4: choosing what to label is sometimes cheaper than labeling more")
ax.set_ylim(0.4, 0.85)
ax.legend()
plt.tight_layout()
plt.show()

What you should see

Both curves climb. The uncertainty curve should sit above the random curve at small budgets (the most informative trials first). The two often converge around budget 30.

If your run shows uncertainty worse than random, increase n_initial to 6 or 8. With too few seed examples, the early uncertainty estimates are unreliable, which sometimes hurts more than it helps. This sensitivity is itself a finding from the active learning literature.

Cog-sci framing

Active learning is not new to cognitive science. It just has different names there:

  • QUEST (Watson and Pelli, 1983): adaptive psychophysics that picks the next stimulus level to maximize information about threshold.
  • Staircase methods: a simpler member of the same family.
  • Computerized adaptive testing: picks test items in real time to maximize information about ability.

All are special cases of “choose the next observation to maximize expected information gain”. Murphy 19.4 is the formal version.

A skilled clinical interviewer also does active learning without realizing it. The questions that buy the most information are the ones whose answers you cannot predict in advance.


3. Meta-learning: a tiny MAML demo (Murphy 19.5)

What MAML does

Standard pretraining (block 2) finds a single good initialization on many subjects, then fine-tunes on a target. MAML (Finn et al., 2017) asks for something stronger: an initialization such that a few gradient steps on the target’s tiny calibration set yield a good model.

The trick: meta-train on episodes. Each episode picks a task (here, a subject), splits its data into a small support set and a query set, takes a few gradient steps on the support, and asks the meta-loss to improve the resulting query loss.

The hierarchical Bayesian view: MAML is learning the prior over task parameters (Murphy Figure 19.14).

Honest disclaimer

On motor imagery, MAML usually does not beat well-tuned pretraining by much. The mechanism is conceptually important and worth seeing once. The absolute number is not the point.

If this cell does not converge in your session, the slides have the diagram and explanation. Do not panic.

The implementation

We use learn2learn, which gives us maml.clone() and learner.adapt(loss) as one-liners.

import learn2learn as l2l

# Use raw signal (not features) since EEGNet is convolutional
target_subject = 1
source_subjects = [s for s in subjects if s != target_subject]

X_t = X[(metadata["subject"] == target_subject).values]
y_t = y_int[(metadata["subject"] == target_subject).values]


def sample_episode(subject_id, k_shot=5, q_shot=10, seed=0):
    """Return support and query tensors for one task."""
    rng = np.random.RandomState(seed)
    mask = (metadata["subject"] == subject_id).values
    Xs, ys = X[mask], y_int[mask]
    sup_X, sup_y, qry_X, qry_y = [], [], [], []
    for c in np.unique(ys):
        idx = np.where(ys == c)[0]
        if len(idx) < k_shot + q_shot:
            return None
        chosen = rng.choice(idx, size=k_shot + q_shot, replace=False)
        sup_X.extend(Xs[chosen[:k_shot]])
        sup_y.extend([c] * k_shot)
        qry_X.extend(Xs[chosen[k_shot:]])
        qry_y.extend([c] * q_shot)
    return (
        torch.from_numpy(np.array(sup_X)).to(device),
        torch.tensor(sup_y, dtype=torch.long).to(device),
        torch.from_numpy(np.array(qry_X)).to(device),
        torch.tensor(qry_y, dtype=torch.long).to(device),
    )


# Build a fresh EEGNet
base_model = EEGNetv4(n_chans=n_chans, n_outputs=2, n_times=n_times)
base_model = base_model.to(device)

maml = l2l.algorithms.MAML(base_model, lr=0.01, first_order=True,
                            allow_unused=True, allow_nograd=True)
optimizer = torch.optim.Adam(maml.parameters(), lr=0.001)

n_meta_iterations = 80
inner_steps = 3
tasks_per_batch = 4
losses = []

for iteration in range(n_meta_iterations):
    optimizer.zero_grad()
    meta_loss = 0.0
    n_tasks = 0
    rng_iter = np.random.RandomState(iteration)
    chosen_tasks = rng_iter.choice(source_subjects,
                                   size=tasks_per_batch, replace=False)

    for task_subj in chosen_tasks:
        ep = sample_episode(task_subj, k_shot=5, q_shot=10,
                            seed=iteration * 7 + int(task_subj))
        if ep is None:
            continue
        sup_X, sup_y, qry_X, qry_y = ep
        learner = maml.clone()
        for _ in range(inner_steps):
            sup_logits = learner(sup_X)
            sup_loss = F.cross_entropy(sup_logits, sup_y)
            learner.adapt(sup_loss)
        qry_logits = learner(qry_X)
        qry_loss = F.cross_entropy(qry_logits, qry_y)
        meta_loss = meta_loss + qry_loss
        n_tasks += 1

    if n_tasks > 0:
        meta_loss = meta_loss / n_tasks
        meta_loss.backward()
        optimizer.step()
        losses.append(meta_loss.item())

    if iteration % 20 == 0:
        print(f"Meta-iteration {iteration}: meta-loss = "
              f"{losses[-1] if losses else 'NA':.4f}")

print("MAML training done.")

Evaluate on the target subject

Adapt the meta-trained initialization on a small calibration set from subject 1, then test on the rest.

def maml_evaluate(maml, X_t, y_t, k_shot=5, inner_steps=5,
                  inner_lr=0.01, seed=0):
    rng = np.random.RandomState(seed)
    classes = np.unique(y_t)
    sup_idx, qry_idx = [], []
    for c in classes:
        idx = np.where(y_t == c)[0]
        chosen = rng.choice(idx, size=k_shot, replace=False)
        sup_idx.extend(chosen)
        qry_idx.extend(np.setdiff1d(idx, chosen))
    sup_idx = np.array(sup_idx); qry_idx = np.array(qry_idx)

    sup_X = torch.from_numpy(X_t[sup_idx]).to(device)
    sup_y = torch.tensor(y_t[sup_idx], dtype=torch.long).to(device)
    qry_X = torch.from_numpy(X_t[qry_idx]).to(device)
    qry_y = torch.tensor(y_t[qry_idx], dtype=torch.long).to(device)

    learner = maml.clone()
    for _ in range(inner_steps):
        sup_logits = learner(sup_X)
        sup_loss = F.cross_entropy(sup_logits, sup_y)
        learner.adapt(sup_loss)
    with torch.no_grad():
        qry_pred = learner(qry_X).argmax(dim=1)
    return (qry_pred == qry_y).float().mean().item()


maml_accs = [maml_evaluate(maml, X_t, y_t, k_shot=5, seed=s)
             for s in range(5)]
print(f"MAML evaluation on S{target_subject}, k=5: "
      f"{np.mean(maml_accs):.3f} +/- {np.std(maml_accs):.3f}")

What you should see

If MAML converged: accuracy in the 0.6 to 0.7 range with k=5 shots. That is comparable to fine-tuning a pretrained model with the same budget, sometimes a bit better.

If MAML did not converge (loss flat or rising): the inner learning rate is wrong for this dataset, or the meta-batch is too small. Try inner_lr=0.005. Or accept that the demo did not work today and refer to the slides.

Cog-sci framing

The phrase “learning to learn” comes from Harlow’s 1949 paper on rhesus monkeys. He showed that monkeys solving many similar discrimination problems get faster at solving new ones, eventually after a single trial. Harlow called this a “learning set”.

MAML is the modern, gradient-friendly formalization of that finding. The hierarchical Bayesian framing in Murphy 19.5 makes the inheritance explicit: the “thing being learned across tasks” is a prior over weights, which functions as the learning set.

It is worth pausing on the historical irony: the field of machine learning rediscovered, in 2017, a phenomenon that the field of cognitive science had named in 1949 and studied for decades. This is not unusual. Most of Chapter 19 has a similar shape.


4. Wrap

We have now used all seven sections of Murphy Chapter 19:

Section Strategy Where we used it
19.1 Data augmentation Block 2, time shift / channel dropout / noise / mixup
19.2 Transfer learning Blocks 2 and 3, EEGNet pretraining and frozen features
19.3 Semi-supervised Block 3, pseudo-labeling on subject 1
19.4 Active learning Block 5, uncertainty vs random acquisition
19.5 Meta-learning Block 5, small MAML demo
19.6 Few-shot Block 4, calibration curve
19.7 Weakly supervised Block 4, label smoothing under noise

Each one was a different way to make a deep model behave a little more like a learner that already had a brain before the data arrived. None of them are deep learning tricks in any deep sense. They are re-inventions, in a different vocabulary, of priors.

That is the cog-sci interpretation of Chapter 19. Whether you find it flattering depends on which department you started in.


5. Discussion

  1. Of the seven strategies we tried today, which one would you reach for first if a colleague handed you their PhD dataset of 30 EEG trials? Why?

  2. Several of the methods we tried (pseudo-labeling, MAML, augmentation) amplify whatever assumptions the model already had. When is that good?

  3. The cognitive analogs we invoked were sometimes loose. Pick one and argue that the analogy is misleading rather than helpful.


6. Block 5 takeaways

  • Active learning is just adaptive experimental design with a loss function.
  • Choosing what to label is often more useful than labeling more.
  • Meta-learning is the formal version of “learning to learn”, which cog sci has been studying since Harlow’s monkeys.
  • Chapter 19 is officially complete.