Block 3. Embeddings as Priors, and Letting the Unlabeled Data Help

Open In Colab

Goal. Two more strategies for the few-labels problem, plus a coda:

Frozen pretrained features (extension of 19.2): use a pretrained encoder to extract good representations, then put a tiny classifier on top.
Semi-supervised learning (19.3): use the unlabeled trials to help.
Coda: the same recipe on text (DSM-5 symptom descriptions) using a free open-weight LLM embedding.

Time. About 60 minutes. The first cell that retrains EEGNet takes 2-3 minutes if a saved checkpoint is not found.

0. Setup

%%capture
!pip install -q moabb==1.1.0 mne==1.7.1 braindecode==0.8.1 skorch==1.0.0 \
                pyriemann==0.6 sentence-transformers==3.0.1 umap-learn==0.5.6

import warnings, os, copy
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mne
import torch
import torch.nn as nn

from moabb.datasets import PhysionetMI
from moabb.paradigms import LeftRightImagery

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

from pyriemann.estimation import Covariances
from pyriemann.tangentspace import TangentSpace

from braindecode.models import EEGNetv4
from braindecode import EEGClassifier

mne.set_log_level("WARNING")
np.random.seed(42)
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Setup complete. Device: {device}")

1. Reload data and pretrained model

dataset = PhysionetMI()
paradigm = LeftRightImagery()
subjects = list(range(1, 11))
X, y, metadata = paradigm.get_data(dataset=dataset, subjects=subjects)

le = LabelEncoder()
y_int = le.fit_transform(y)
X = X.astype(np.float32)
n_chans, n_times = X.shape[1], X.shape[2]

print(f"X shape: {X.shape}")

Bring in the pretrained EEGNet from block 2 (or train it if the cache is empty).

PRETRAINED_PATH = "/content/eegnet_pretrained_block2.pt"
target_subject = 1

def make_eegnet(epochs=20, lr=0.001, batch=32):
    return EEGClassifier(
        EEGNetv4,
        module__n_chans=n_chans,
        module__n_outputs=2,
        module__n_times=n_times,
        optimizer=torch.optim.AdamW,
        optimizer__lr=lr,
        optimizer__weight_decay=0.01,
        train_split=None,
        batch_size=batch,
        max_epochs=epochs,
        device=device,
        verbose=0,
    )

source_mask = (metadata["subject"] != target_subject).values

clf_pretrained = make_eegnet(epochs=30)
clf_pretrained.initialize()

if os.path.exists(PRETRAINED_PATH):
    print("Loading pretrained EEGNet from disk")
    clf_pretrained.load_params(f_params=PRETRAINED_PATH)
else:
    print("No checkpoint found, retraining (2-3 minutes)")
    clf_pretrained.fit(X[source_mask], y_int[source_mask])
    clf_pretrained.save_params(f_params=PRETRAINED_PATH)
print("Pretrained model ready.")

2. Three feature extractors compared

The unifying argument of section 19.2 is that the right representation does most of the work. With good features, a tiny classifier on top is enough.

We compare three feature extractors on subject 1:

Features	Era	Cost
Bandpower	1990s	Trivial
Riemannian tangent space	2010s	Cheap
Pretrained EEGNet penultimate layer	2020s	One forward pass

Each one feeds into the same logistic regression. We see which wins on small labeled sets.

2a. Bandpower features (the block 1 baseline)

from scipy.signal import welch

def bandpower_features(Xin, sfreq=160):
    bands = (("alpha", 8, 13), ("beta", 13, 30))
    feats = np.zeros((Xin.shape[0], Xin.shape[1] * len(bands)),
                     dtype=np.float32)
    for t in range(Xin.shape[0]):
        for c in range(Xin.shape[1]):
            f, psd = welch(Xin[t, c], fs=sfreq,
                           nperseg=min(256, Xin.shape[2]))
            for b, (_, lo, hi) in enumerate(bands):
                m = (f >= lo) & (f <= hi)
                feats[t, c * len(bands) + b] = np.log(psd[m].mean() + 1e-12)
    return feats

bp_feats = bandpower_features(X)
print(f"Bandpower features: {bp_feats.shape}")

2b. Riemannian tangent space features

The intuition: each EEG trial gives you a covariance matrix across channels. These matrices live on a curved manifold (the cone of symmetric positive-definite matrices). Project them to a flat tangent space and they become regular vectors that any classifier can eat.

This trick has been embarrassingly competitive on motor imagery for over a decade.

cov = Covariances(estimator="oas").fit_transform(X)
ri_feats = TangentSpace().fit_transform(cov).astype(np.float32)
print(f"Riemannian features: {ri_feats.shape}")

2c. Frozen EEGNet penultimate features

Take the pretrained model from block 2. Freeze it. Run all trials through it. Extract the activations of the layer before the final classifier.

We use a forward hook to grab the input to final_layer.

backbone = clf_pretrained.module_
backbone.eval()
backbone.to(device)

penult_buffer = []
def grab_input(module, inp, out):
    penult_buffer.append(inp[0].detach().cpu().numpy())

handle = backbone.final_layer.register_forward_hook(grab_input)

penult_buffer.clear()
with torch.no_grad():
    for i in range(0, len(X), 64):
        batch = torch.from_numpy(X[i:i+64]).to(device)
        _ = backbone(batch)

handle.remove()
en_feats = np.concatenate(
    [f.reshape(f.shape[0], -1) for f in penult_buffer]
).astype(np.float32)
print(f"EEGNet penultimate features: {en_feats.shape}")

2d. Sweep the few-labels learning curve for all three

Same logistic regression on top of each feature set, on subject 1, varying the labeled budget.

mask_subj1 = (metadata["subject"] == target_subject).values

def few_label_curve(features, y_arr, mask, n_list=(5, 10, 20, 40), n_seeds=10):
    out = []
    f_pool = features[mask]
    y_pool = y_arr[mask]
    classes = np.unique(y_pool)
    for n in n_list:
        accs = []
        for seed in range(n_seeds):
            rng = np.random.RandomState(seed)
            chosen = []
            for c in classes:
                idx = np.where(y_pool == c)[0]
                if n > len(idx):
                    accs.append(np.nan)
                    continue
                chosen.extend(rng.choice(idx, size=n, replace=False))
            chosen = np.array(chosen)
            rest = np.setdiff1d(np.arange(len(y_pool)), chosen)
            pipe = make_pipeline(StandardScaler(),
                                 LogisticRegression(max_iter=1000, C=1.0))
            pipe.fit(f_pool[chosen], y_pool[chosen])
            accs.append(pipe.score(f_pool[rest], y_pool[rest]))
        out.append((n, np.nanmean(accs), np.nanstd(accs)))
    return out

curves = {
    "bandpower (1990s)": few_label_curve(bp_feats, y_int, mask_subj1),
    "Riemannian (2010s)": few_label_curve(ri_feats, y_int, mask_subj1),
    "EEGNet pretrained (2020s)": few_label_curve(en_feats, y_int, mask_subj1),
}

fig, ax = plt.subplots(figsize=(8, 5))
for name, curve in curves.items():
    ns = [c[0] for c in curve]
    means = [c[1] for c in curve]
    stds = [c[2] for c in curve]
    ax.errorbar(ns, means, yerr=stds, marker="o", capsize=4, label=name)
ax.axhline(0.5, ls="--", color="k", alpha=0.5, label="chance")
ax.set_xlabel("Trials per class in training set")
ax.set_ylabel("Accuracy on held-out trials of S1")
ax.set_title("19.2: a frozen pretrained representation often wins")
ax.set_ylim(0.4, 1.0)
ax.legend()
plt.tight_layout()
plt.show()

The expected ranking: at very small N, the two frozen-feature methods clearly beat bandpower. At larger N, they converge. Riemannian and EEGNet often trade places depending on the seed.

The cog-sci version of this story: a perceptual system with good prior representations (whether geometric or learned from many brains) needs fewer examples to learn a new distinction. Bayesians have known this since the 1960s.

3. Pseudo-labeling (Murphy 19.3)

So far we have ignored a lot of free data.

Subject 1 has many trials. We pretended only some of them came with labels. But the trials we did not label are still there, and we know their distribution matches the labeled ones (same subject, same task).

Semi-supervised learning is the family of methods that uses these unlabeled trials to help. Murphy 19.3 lists several. The simplest is pseudo-labeling.

The recipe

Train a classifier on the small labeled set.
Predict on the unlabeled pool. Keep predictions you are very confident about.
Add those high-confidence pseudo-labels to your training set. Retrain.

Cognitive analog: this is what every grad student does in their first literature review. You read a few papers carefully (labels), skim many more (unlabeled), trust your judgment about the easy ones, and keep going. If your judgment is good enough, this works. If not, it amplifies your initial bias.

The implementation

def pseudo_label_run(features, y_arr, mask, n_labeled_per_class,
                     n_rounds=4, threshold=0.85, seed=0):
    """
    Returns a DataFrame with one row per round, tracking:
    - n_train: total training set size
    - test_acc_sup: held-out accuracy of supervised baseline
    - test_acc_pseudo: held-out accuracy after this round of pseudo-labeling
    - pseudo_label_acc: accuracy of newly added pseudo-labels (vs hidden truth)
    """
    rng = np.random.RandomState(seed)
    f_pool = features[mask]
    y_pool = y_arr[mask]
    classes = np.unique(y_pool)

    # Held-out clean test set
    test_idx = []
    for c in classes:
        idx = np.where(y_pool == c)[0]
        n_test = len(idx) // 3
        test_idx.extend(rng.choice(idx, size=n_test, replace=False))
    test_idx = np.array(test_idx)
    nontest = np.setdiff1d(np.arange(len(y_pool)), test_idx)

    # Of nontest: pick N as labeled, rest as unlabeled
    labeled_idx = []
    for c in classes:
        idx = np.intersect1d(nontest, np.where(y_pool == c)[0])
        labeled_idx.extend(rng.choice(idx, size=n_labeled_per_class,
                                       replace=False))
    labeled_idx = np.array(labeled_idx)
    pool_idx = np.setdiff1d(nontest, labeled_idx)

    X_test, y_test = f_pool[test_idx], y_pool[test_idx]

    # Supervised-only baseline
    pipe_sup = make_pipeline(StandardScaler(),
                              LogisticRegression(max_iter=1000, C=1.0))
    pipe_sup.fit(f_pool[labeled_idx], y_pool[labeled_idx])
    sup_acc = pipe_sup.score(X_test, y_test)

    # Pseudo-label loop
    train_idx = list(labeled_idx)
    train_y = list(y_pool[labeled_idx])
    pool = list(pool_idx)
    history = [{"round": 0, "n_train": len(train_idx),
                "test_acc_sup": sup_acc, "test_acc_pseudo": sup_acc,
                "pseudo_label_acc": np.nan}]

    for r in range(1, n_rounds + 1):
        if not pool:
            break
        pipe_psd = make_pipeline(StandardScaler(),
                                  LogisticRegression(max_iter=1000, C=1.0))
        pipe_psd.fit(f_pool[train_idx], np.array(train_y))
        proba = pipe_psd.predict_proba(f_pool[pool])
        confident = proba.max(axis=1) >= threshold
        if confident.sum() == 0:
            break
        new_idx = np.array(pool)[confident]
        new_y = pipe_psd.predict(f_pool[new_idx])
        true_y = y_pool[new_idx]
        pl_acc = (new_y == true_y).mean()

        train_idx.extend(new_idx)
        train_y.extend(new_y)
        pool = list(set(pool) - set(new_idx))

        test_acc = pipe_psd.score(X_test, y_test)
        history.append({"round": r, "n_train": len(train_idx),
                        "test_acc_sup": sup_acc, "test_acc_pseudo": test_acc,
                        "pseudo_label_acc": pl_acc})
    return pd.DataFrame(history)

Run it on subject 1 with the EEGNet features (which have the best calibration of confidence) and 10 labeled trials per class.

hist = pseudo_label_run(en_feats, y_int, mask_subj1,
                        n_labeled_per_class=10, n_rounds=4, threshold=0.85)
print(hist.to_string(index=False))

Plot

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.plot(hist["round"], hist["test_acc_sup"], "o--",
        label="supervised baseline (no pseudo-labels)")
ax.plot(hist["round"], hist["test_acc_pseudo"], "s-",
        label="pseudo-labeled")
ax.set_xlabel("Round")
ax.set_ylabel("Accuracy on held-out test set")
ax.set_title("19.3: does pseudo-labeling help?")
ax.legend()
ax.set_ylim(0.5, 0.9)

ax = axes[1]
ax.bar(hist["round"][1:], hist["pseudo_label_acc"][1:], color="C2", alpha=0.7)
ax.axhline(1.0, ls="--", color="k", alpha=0.3)
ax.set_xlabel("Round")
ax.set_ylabel("Accuracy of new pseudo-labels (vs hidden truth)")
ax.set_title("Are the pseudo-labels right?")
ax.set_ylim(0.0, 1.05)

plt.tight_layout()
plt.show()

What you should see

The pseudo-labeled curve sits a few points above the baseline at small labeled budgets. The pseudo-label accuracy starts very high (we kept only the confident ones) and gradually drops as the model gets bolder and labels harder examples.

If your run shows the pseudo-labeled line below the baseline, the threshold was too low. Try 0.90 instead of 0.85.

The honest caveat

Pseudo-labeling amplifies whatever bias your initial model had. If the labeled set is unrepresentative, the pseudo-labels will be too, and the model gets confidently worse. The literature on semi-supervised learning is, in part, a long argument about how to keep this from happening (consistency regularization, mean teacher, FixMatch, MixMatch). Murphy 19.3 covers them; we are using the simplest member of the family.

4. The text coda: same recipe on DSM-5 symptoms

The point of this 10-minute detour: the recipe (frozen pretrained encoder + tiny classifier on top) is not EEG-specific. Take any modality, plug in the right encoder, and the same pattern works.

We use a 2024+ open-weight LLM embedding model (nomic-embed-text-v1.5), about 137M parameters, fully free, runs on T4 in seconds. We embed short symptom descriptions across four psychiatric categories and watch them cluster.

The data

About 10 generic, paraphrased symptom descriptions per category. Not verbatim DSM text (that would be copyrighted and also unnecessary for the demo).

symptoms = {
    "Major Depressive Disorder": [
        "I feel sad or empty most of the day",
        "I have lost interest in activities I used to enjoy",
        "I am tired all the time even after resting",
        "I have difficulty concentrating on simple tasks",
        "I feel worthless or guilty without clear reason",
        "I have trouble making everyday decisions",
        "My appetite has changed noticeably",
        "I sleep too much or cannot sleep at all",
        "I think about death frequently",
        "I move and speak more slowly than I used to",
    ],
    "Generalized Anxiety Disorder": [
        "I worry about everything even small things",
        "I cannot control my worrying thoughts",
        "I feel restless and on edge most days",
        "I am tired despite not doing very much",
        "I have difficulty concentrating because of worry",
        "I am irritable with people around me",
        "My muscles feel tense and sore for no reason",
        "I cannot fall asleep because my mind is racing",
        "I feel a constant sense of impending doom",
        "I worry about my health all the time",
    ],
    "ADHD": [
        "I am easily distracted by sounds or movement",
        "I have trouble finishing what I start",
        "I lose important objects regularly",
        "I forget appointments and obligations",
        "I find it hard to sit still during meetings",
        "I interrupt others when they are speaking",
        "I make careless mistakes at work or school",
        "I avoid tasks that require sustained mental effort",
        "I fidget with my hands or feet constantly",
        "I have trouble organizing my daily tasks",
    ],
    "Insomnia Disorder": [
        "I lie awake for hours before falling asleep",
        "I wake up multiple times during the night",
        "I wake up too early and cannot return to sleep",
        "I feel unrefreshed in the morning despite sleeping",
        "I worry about not being able to sleep",
        "I am sleepy during the day from lack of sleep",
        "My sleep is interrupted by physical discomfort",
        "I dread going to bed because of my sleeplessness",
        "I rely on substances to fall asleep",
        "My total sleep time is much less than I need",
    ],
}

texts = []
labels = []
for cat, items in symptoms.items():
    texts.extend(items)
    labels.extend([cat] * len(items))
labels = np.array(labels)
print(f"Total sentences: {len(texts)}, classes: {set(labels)}")

Embed with `nomic-embed-text-v1.5`

from sentence_transformers import SentenceTransformer

emb_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5",
                                trust_remote_code=True)
emb = emb_model.encode(texts, show_progress_bar=False)
print(f"Embedding shape: {emb.shape}")

Visualize with UMAP

import umap

reducer = umap.UMAP(n_neighbors=8, min_dist=0.3, random_state=42)
emb_2d = reducer.fit_transform(emb)

fig, ax = plt.subplots(figsize=(8, 6))
for cat in sorted(set(labels)):
    mask = labels == cat
    ax.scatter(emb_2d[mask, 0], emb_2d[mask, 1], label=cat, s=80, alpha=0.8)
ax.set_title("DSM-5 symptoms in nomic-embed space")
ax.set_xlabel("UMAP 1")
ax.set_ylabel("UMAP 2")
ax.legend(loc="best", fontsize=9)
plt.tight_layout()
plt.show()

You should see four reasonably separated clusters. The model has learned, with no supervision from us, that these four symptom families are distinct.

The classification: same recipe as the EEG

# Few-shot k-NN on top of the embeddings
from sklearn.metrics import accuracy_score

n_per_class_list = [1, 2, 3, 5]
text_results = []

for n in n_per_class_list:
    accs = []
    for seed in range(20):
        rng = np.random.RandomState(seed)
        train_idx, test_idx = [], []
        for cat in sorted(set(labels)):
            idx = np.where(labels == cat)[0]
            chosen = rng.choice(idx, size=n, replace=False)
            train_idx.extend(chosen)
            test_idx.extend(np.setdiff1d(idx, chosen))
        train_idx = np.array(train_idx); test_idx = np.array(test_idx)
        knn = KNeighborsClassifier(n_neighbors=1)
        knn.fit(emb[train_idx], labels[train_idx])
        accs.append(accuracy_score(labels[test_idx], knn.predict(emb[test_idx])))
    text_results.append((n, np.mean(accs), np.std(accs)))

print(pd.DataFrame(text_results, columns=["n_per_class", "mean_acc", "std_acc"]))

With just 1 example per class (true 1-shot), accuracy on the held-out sentences should already be in the 0.7 to 0.9 range, depending on which sentence happens to be the support example. With 5 per class it should approach 1.0.

This is the same shape as the few-shot calibration curve we will see in block 4 with EEG. Same recipe, very different modalities.

Optional swap: CogText alternative

If you want to swap in Cognitive Atlas task descriptions instead of DSM-5 symptoms (closer to cognitive science, less clinical), replace the symptoms dictionary with categories like:

Working memory: descriptions of N-back, digit span, Sternberg, etc.
Response inhibition: Stroop, Flanker, Stop-signal, Go/No-go.
Attention: Posner cueing, search, vigilance.
Decision making: Iowa gambling, two-armed bandit, intertemporal choice.

The rest of the pipeline does not change. The clusters separate just as cleanly, often more cleanly than DSM-5 because the linguistic vocabulary is more distinct.

5. Discussion

The Riemannian features sometimes beat the EEGNet pretrained features on small N. What does that tell you about the value of “deep” representations for EEG, compared to “geometric” representations?
Pseudo-labeling worked here because subject 1’s unlabeled trials came from the same distribution as the labeled ones. What if the unlabeled pool came from a different subject? Would it still help?
The DSM-5 clustering looked clean. Would it look as clean if you used a smaller, older embedding model (e.g., all-MiniLM-L6-v2 from 2021)? What does the difference tell you about how much the encoder matters?

6. Block 3 takeaways

A frozen pretrained encoder plus a simple classifier is the strongest first move in 2026.
Geometry-aware features (Riemannian) are still hard to beat on small EEG data.
Pseudo-labeling lets unlabeled data pull weight, but it amplifies whatever bias you already had.
The recipe is universal across modalities. The same idea with nomic-embed works on text.
We have now used Murphy 19.1, 19.2, and 19.3. After the next block we will have used 19.6 and 19.7 too.