Block 3. Embeddings as Priors, and Letting the Unlabeled Data Help
Goal. Two more strategies for the few-labels problem, plus a coda:
- Frozen pretrained features (extension of 19.2): use a pretrained encoder to extract good representations, then put a tiny classifier on top.
- Semi-supervised learning (19.3): use the unlabeled trials to help.
- Coda: the same recipe on text (DSM-5 symptom descriptions) using a free open-weight LLM embedding.
Time. About 60 minutes. The first cell that retrains EEGNet takes 2-3 minutes if a saved checkpoint is not found.
0. Setup
%%capture
!pip install -q moabb==1.1.0 mne==1.7.1 braindecode==0.8.1 skorch==1.0.0 \
pyriemann==0.6 sentence-transformers==3.0.1 umap-learn==0.5.6import warnings, os, copy
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mne
import torch
import torch.nn as nn
from moabb.datasets import PhysionetMI
from moabb.paradigms import LeftRightImagery
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from pyriemann.estimation import Covariances
from pyriemann.tangentspace import TangentSpace
from braindecode.models import EEGNetv4
from braindecode import EEGClassifier
mne.set_log_level("WARNING")
np.random.seed(42)
torch.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Setup complete. Device: {device}")1. Reload data and pretrained model
dataset = PhysionetMI()
paradigm = LeftRightImagery()
subjects = list(range(1, 11))
X, y, metadata = paradigm.get_data(dataset=dataset, subjects=subjects)
le = LabelEncoder()
y_int = le.fit_transform(y)
X = X.astype(np.float32)
n_chans, n_times = X.shape[1], X.shape[2]
print(f"X shape: {X.shape}")Bring in the pretrained EEGNet from block 2 (or train it if the cache is empty).
PRETRAINED_PATH = "/content/eegnet_pretrained_block2.pt"
target_subject = 1
def make_eegnet(epochs=20, lr=0.001, batch=32):
return EEGClassifier(
EEGNetv4,
module__n_chans=n_chans,
module__n_outputs=2,
module__n_times=n_times,
optimizer=torch.optim.AdamW,
optimizer__lr=lr,
optimizer__weight_decay=0.01,
train_split=None,
batch_size=batch,
max_epochs=epochs,
device=device,
verbose=0,
)
source_mask = (metadata["subject"] != target_subject).values
clf_pretrained = make_eegnet(epochs=30)
clf_pretrained.initialize()
if os.path.exists(PRETRAINED_PATH):
print("Loading pretrained EEGNet from disk")
clf_pretrained.load_params(f_params=PRETRAINED_PATH)
else:
print("No checkpoint found, retraining (2-3 minutes)")
clf_pretrained.fit(X[source_mask], y_int[source_mask])
clf_pretrained.save_params(f_params=PRETRAINED_PATH)
print("Pretrained model ready.")2. Three feature extractors compared
The unifying argument of section 19.2 is that the right representation does most of the work. With good features, a tiny classifier on top is enough.
We compare three feature extractors on subject 1:
| Features | Era | Cost |
|---|---|---|
| Bandpower | 1990s | Trivial |
| Riemannian tangent space | 2010s | Cheap |
| Pretrained EEGNet penultimate layer | 2020s | One forward pass |
Each one feeds into the same logistic regression. We see which wins on small labeled sets.
2a. Bandpower features (the block 1 baseline)
from scipy.signal import welch
def bandpower_features(Xin, sfreq=160):
bands = (("alpha", 8, 13), ("beta", 13, 30))
feats = np.zeros((Xin.shape[0], Xin.shape[1] * len(bands)),
dtype=np.float32)
for t in range(Xin.shape[0]):
for c in range(Xin.shape[1]):
f, psd = welch(Xin[t, c], fs=sfreq,
nperseg=min(256, Xin.shape[2]))
for b, (_, lo, hi) in enumerate(bands):
m = (f >= lo) & (f <= hi)
feats[t, c * len(bands) + b] = np.log(psd[m].mean() + 1e-12)
return feats
bp_feats = bandpower_features(X)
print(f"Bandpower features: {bp_feats.shape}")2b. Riemannian tangent space features
The intuition: each EEG trial gives you a covariance matrix across channels. These matrices live on a curved manifold (the cone of symmetric positive-definite matrices). Project them to a flat tangent space and they become regular vectors that any classifier can eat.
This trick has been embarrassingly competitive on motor imagery for over a decade.
cov = Covariances(estimator="oas").fit_transform(X)
ri_feats = TangentSpace().fit_transform(cov).astype(np.float32)
print(f"Riemannian features: {ri_feats.shape}")2c. Frozen EEGNet penultimate features
Take the pretrained model from block 2. Freeze it. Run all trials through it. Extract the activations of the layer before the final classifier.
We use a forward hook to grab the input to final_layer.
backbone = clf_pretrained.module_
backbone.eval()
backbone.to(device)
penult_buffer = []
def grab_input(module, inp, out):
penult_buffer.append(inp[0].detach().cpu().numpy())
handle = backbone.final_layer.register_forward_hook(grab_input)
penult_buffer.clear()
with torch.no_grad():
for i in range(0, len(X), 64):
batch = torch.from_numpy(X[i:i+64]).to(device)
_ = backbone(batch)
handle.remove()
en_feats = np.concatenate(
[f.reshape(f.shape[0], -1) for f in penult_buffer]
).astype(np.float32)
print(f"EEGNet penultimate features: {en_feats.shape}")2d. Sweep the few-labels learning curve for all three
Same logistic regression on top of each feature set, on subject 1, varying the labeled budget.
mask_subj1 = (metadata["subject"] == target_subject).values
def few_label_curve(features, y_arr, mask, n_list=(5, 10, 20, 40), n_seeds=10):
out = []
f_pool = features[mask]
y_pool = y_arr[mask]
classes = np.unique(y_pool)
for n in n_list:
accs = []
for seed in range(n_seeds):
rng = np.random.RandomState(seed)
chosen = []
for c in classes:
idx = np.where(y_pool == c)[0]
if n > len(idx):
accs.append(np.nan)
continue
chosen.extend(rng.choice(idx, size=n, replace=False))
chosen = np.array(chosen)
rest = np.setdiff1d(np.arange(len(y_pool)), chosen)
pipe = make_pipeline(StandardScaler(),
LogisticRegression(max_iter=1000, C=1.0))
pipe.fit(f_pool[chosen], y_pool[chosen])
accs.append(pipe.score(f_pool[rest], y_pool[rest]))
out.append((n, np.nanmean(accs), np.nanstd(accs)))
return out
curves = {
"bandpower (1990s)": few_label_curve(bp_feats, y_int, mask_subj1),
"Riemannian (2010s)": few_label_curve(ri_feats, y_int, mask_subj1),
"EEGNet pretrained (2020s)": few_label_curve(en_feats, y_int, mask_subj1),
}fig, ax = plt.subplots(figsize=(8, 5))
for name, curve in curves.items():
ns = [c[0] for c in curve]
means = [c[1] for c in curve]
stds = [c[2] for c in curve]
ax.errorbar(ns, means, yerr=stds, marker="o", capsize=4, label=name)
ax.axhline(0.5, ls="--", color="k", alpha=0.5, label="chance")
ax.set_xlabel("Trials per class in training set")
ax.set_ylabel("Accuracy on held-out trials of S1")
ax.set_title("19.2: a frozen pretrained representation often wins")
ax.set_ylim(0.4, 1.0)
ax.legend()
plt.tight_layout()
plt.show()The expected ranking: at very small N, the two frozen-feature methods clearly beat bandpower. At larger N, they converge. Riemannian and EEGNet often trade places depending on the seed.
The cog-sci version of this story: a perceptual system with good prior representations (whether geometric or learned from many brains) needs fewer examples to learn a new distinction. Bayesians have known this since the 1960s.
3. Pseudo-labeling (Murphy 19.3)
So far we have ignored a lot of free data.
Subject 1 has many trials. We pretended only some of them came with labels. But the trials we did not label are still there, and we know their distribution matches the labeled ones (same subject, same task).
Semi-supervised learning is the family of methods that uses these unlabeled trials to help. Murphy 19.3 lists several. The simplest is pseudo-labeling.
The recipe
- Train a classifier on the small labeled set.
- Predict on the unlabeled pool. Keep predictions you are very confident about.
- Add those high-confidence pseudo-labels to your training set. Retrain.
Cognitive analog: this is what every grad student does in their first literature review. You read a few papers carefully (labels), skim many more (unlabeled), trust your judgment about the easy ones, and keep going. If your judgment is good enough, this works. If not, it amplifies your initial bias.
The implementation
def pseudo_label_run(features, y_arr, mask, n_labeled_per_class,
n_rounds=4, threshold=0.85, seed=0):
"""
Returns a DataFrame with one row per round, tracking:
- n_train: total training set size
- test_acc_sup: held-out accuracy of supervised baseline
- test_acc_pseudo: held-out accuracy after this round of pseudo-labeling
- pseudo_label_acc: accuracy of newly added pseudo-labels (vs hidden truth)
"""
rng = np.random.RandomState(seed)
f_pool = features[mask]
y_pool = y_arr[mask]
classes = np.unique(y_pool)
# Held-out clean test set
test_idx = []
for c in classes:
idx = np.where(y_pool == c)[0]
n_test = len(idx) // 3
test_idx.extend(rng.choice(idx, size=n_test, replace=False))
test_idx = np.array(test_idx)
nontest = np.setdiff1d(np.arange(len(y_pool)), test_idx)
# Of nontest: pick N as labeled, rest as unlabeled
labeled_idx = []
for c in classes:
idx = np.intersect1d(nontest, np.where(y_pool == c)[0])
labeled_idx.extend(rng.choice(idx, size=n_labeled_per_class,
replace=False))
labeled_idx = np.array(labeled_idx)
pool_idx = np.setdiff1d(nontest, labeled_idx)
X_test, y_test = f_pool[test_idx], y_pool[test_idx]
# Supervised-only baseline
pipe_sup = make_pipeline(StandardScaler(),
LogisticRegression(max_iter=1000, C=1.0))
pipe_sup.fit(f_pool[labeled_idx], y_pool[labeled_idx])
sup_acc = pipe_sup.score(X_test, y_test)
# Pseudo-label loop
train_idx = list(labeled_idx)
train_y = list(y_pool[labeled_idx])
pool = list(pool_idx)
history = [{"round": 0, "n_train": len(train_idx),
"test_acc_sup": sup_acc, "test_acc_pseudo": sup_acc,
"pseudo_label_acc": np.nan}]
for r in range(1, n_rounds + 1):
if not pool:
break
pipe_psd = make_pipeline(StandardScaler(),
LogisticRegression(max_iter=1000, C=1.0))
pipe_psd.fit(f_pool[train_idx], np.array(train_y))
proba = pipe_psd.predict_proba(f_pool[pool])
confident = proba.max(axis=1) >= threshold
if confident.sum() == 0:
break
new_idx = np.array(pool)[confident]
new_y = pipe_psd.predict(f_pool[new_idx])
true_y = y_pool[new_idx]
pl_acc = (new_y == true_y).mean()
train_idx.extend(new_idx)
train_y.extend(new_y)
pool = list(set(pool) - set(new_idx))
test_acc = pipe_psd.score(X_test, y_test)
history.append({"round": r, "n_train": len(train_idx),
"test_acc_sup": sup_acc, "test_acc_pseudo": test_acc,
"pseudo_label_acc": pl_acc})
return pd.DataFrame(history)Run it on subject 1 with the EEGNet features (which have the best calibration of confidence) and 10 labeled trials per class.
hist = pseudo_label_run(en_feats, y_int, mask_subj1,
n_labeled_per_class=10, n_rounds=4, threshold=0.85)
print(hist.to_string(index=False))Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ax = axes[0]
ax.plot(hist["round"], hist["test_acc_sup"], "o--",
label="supervised baseline (no pseudo-labels)")
ax.plot(hist["round"], hist["test_acc_pseudo"], "s-",
label="pseudo-labeled")
ax.set_xlabel("Round")
ax.set_ylabel("Accuracy on held-out test set")
ax.set_title("19.3: does pseudo-labeling help?")
ax.legend()
ax.set_ylim(0.5, 0.9)
ax = axes[1]
ax.bar(hist["round"][1:], hist["pseudo_label_acc"][1:], color="C2", alpha=0.7)
ax.axhline(1.0, ls="--", color="k", alpha=0.3)
ax.set_xlabel("Round")
ax.set_ylabel("Accuracy of new pseudo-labels (vs hidden truth)")
ax.set_title("Are the pseudo-labels right?")
ax.set_ylim(0.0, 1.05)
plt.tight_layout()
plt.show()What you should see
The pseudo-labeled curve sits a few points above the baseline at small labeled budgets. The pseudo-label accuracy starts very high (we kept only the confident ones) and gradually drops as the model gets bolder and labels harder examples.
If your run shows the pseudo-labeled line below the baseline, the threshold was too low. Try 0.90 instead of 0.85.
The honest caveat
Pseudo-labeling amplifies whatever bias your initial model had. If the labeled set is unrepresentative, the pseudo-labels will be too, and the model gets confidently worse. The literature on semi-supervised learning is, in part, a long argument about how to keep this from happening (consistency regularization, mean teacher, FixMatch, MixMatch). Murphy 19.3 covers them; we are using the simplest member of the family.
4. The text coda: same recipe on DSM-5 symptoms
The point of this 10-minute detour: the recipe (frozen pretrained encoder + tiny classifier on top) is not EEG-specific. Take any modality, plug in the right encoder, and the same pattern works.
We use a 2024+ open-weight LLM embedding model (nomic-embed-text-v1.5), about 137M parameters, fully free, runs on T4 in seconds. We embed short symptom descriptions across four psychiatric categories and watch them cluster.
The data
About 10 generic, paraphrased symptom descriptions per category. Not verbatim DSM text (that would be copyrighted and also unnecessary for the demo).
symptoms = {
"Major Depressive Disorder": [
"I feel sad or empty most of the day",
"I have lost interest in activities I used to enjoy",
"I am tired all the time even after resting",
"I have difficulty concentrating on simple tasks",
"I feel worthless or guilty without clear reason",
"I have trouble making everyday decisions",
"My appetite has changed noticeably",
"I sleep too much or cannot sleep at all",
"I think about death frequently",
"I move and speak more slowly than I used to",
],
"Generalized Anxiety Disorder": [
"I worry about everything even small things",
"I cannot control my worrying thoughts",
"I feel restless and on edge most days",
"I am tired despite not doing very much",
"I have difficulty concentrating because of worry",
"I am irritable with people around me",
"My muscles feel tense and sore for no reason",
"I cannot fall asleep because my mind is racing",
"I feel a constant sense of impending doom",
"I worry about my health all the time",
],
"ADHD": [
"I am easily distracted by sounds or movement",
"I have trouble finishing what I start",
"I lose important objects regularly",
"I forget appointments and obligations",
"I find it hard to sit still during meetings",
"I interrupt others when they are speaking",
"I make careless mistakes at work or school",
"I avoid tasks that require sustained mental effort",
"I fidget with my hands or feet constantly",
"I have trouble organizing my daily tasks",
],
"Insomnia Disorder": [
"I lie awake for hours before falling asleep",
"I wake up multiple times during the night",
"I wake up too early and cannot return to sleep",
"I feel unrefreshed in the morning despite sleeping",
"I worry about not being able to sleep",
"I am sleepy during the day from lack of sleep",
"My sleep is interrupted by physical discomfort",
"I dread going to bed because of my sleeplessness",
"I rely on substances to fall asleep",
"My total sleep time is much less than I need",
],
}
texts = []
labels = []
for cat, items in symptoms.items():
texts.extend(items)
labels.extend([cat] * len(items))
labels = np.array(labels)
print(f"Total sentences: {len(texts)}, classes: {set(labels)}")Embed with nomic-embed-text-v1.5
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5",
trust_remote_code=True)
emb = emb_model.encode(texts, show_progress_bar=False)
print(f"Embedding shape: {emb.shape}")Visualize with UMAP
import umap
reducer = umap.UMAP(n_neighbors=8, min_dist=0.3, random_state=42)
emb_2d = reducer.fit_transform(emb)
fig, ax = plt.subplots(figsize=(8, 6))
for cat in sorted(set(labels)):
mask = labels == cat
ax.scatter(emb_2d[mask, 0], emb_2d[mask, 1], label=cat, s=80, alpha=0.8)
ax.set_title("DSM-5 symptoms in nomic-embed space")
ax.set_xlabel("UMAP 1")
ax.set_ylabel("UMAP 2")
ax.legend(loc="best", fontsize=9)
plt.tight_layout()
plt.show()You should see four reasonably separated clusters. The model has learned, with no supervision from us, that these four symptom families are distinct.
The classification: same recipe as the EEG
# Few-shot k-NN on top of the embeddings
from sklearn.metrics import accuracy_score
n_per_class_list = [1, 2, 3, 5]
text_results = []
for n in n_per_class_list:
accs = []
for seed in range(20):
rng = np.random.RandomState(seed)
train_idx, test_idx = [], []
for cat in sorted(set(labels)):
idx = np.where(labels == cat)[0]
chosen = rng.choice(idx, size=n, replace=False)
train_idx.extend(chosen)
test_idx.extend(np.setdiff1d(idx, chosen))
train_idx = np.array(train_idx); test_idx = np.array(test_idx)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(emb[train_idx], labels[train_idx])
accs.append(accuracy_score(labels[test_idx], knn.predict(emb[test_idx])))
text_results.append((n, np.mean(accs), np.std(accs)))
print(pd.DataFrame(text_results, columns=["n_per_class", "mean_acc", "std_acc"]))With just 1 example per class (true 1-shot), accuracy on the held-out sentences should already be in the 0.7 to 0.9 range, depending on which sentence happens to be the support example. With 5 per class it should approach 1.0.
This is the same shape as the few-shot calibration curve we will see in block 4 with EEG. Same recipe, very different modalities.
Optional swap: CogText alternative
If you want to swap in Cognitive Atlas task descriptions instead of DSM-5 symptoms (closer to cognitive science, less clinical), replace the symptoms dictionary with categories like:
- Working memory: descriptions of N-back, digit span, Sternberg, etc.
- Response inhibition: Stroop, Flanker, Stop-signal, Go/No-go.
- Attention: Posner cueing, search, vigilance.
- Decision making: Iowa gambling, two-armed bandit, intertemporal choice.
The rest of the pipeline does not change. The clusters separate just as cleanly, often more cleanly than DSM-5 because the linguistic vocabulary is more distinct.
5. Discussion
The Riemannian features sometimes beat the EEGNet pretrained features on small N. What does that tell you about the value of “deep” representations for EEG, compared to “geometric” representations?
Pseudo-labeling worked here because subject 1’s unlabeled trials came from the same distribution as the labeled ones. What if the unlabeled pool came from a different subject? Would it still help?
The DSM-5 clustering looked clean. Would it look as clean if you used a smaller, older embedding model (e.g.,
all-MiniLM-L6-v2from 2021)? What does the difference tell you about how much the encoder matters?
6. Block 3 takeaways
- A frozen pretrained encoder plus a simple classifier is the strongest first move in 2026.
- Geometry-aware features (Riemannian) are still hard to beat on small EEG data.
- Pseudo-labeling lets unlabeled data pull weight, but it amplifies whatever bias you already had.
- The recipe is universal across modalities. The same idea with
nomic-embedworks on text. - We have now used Murphy 19.1, 19.2, and 19.3. After the next block we will have used 19.6 and 19.7 too.