Brain Path Specification

Universal address for multimodal cognitive data

This document specifies the structure, semantics, and usage of the brain path.

What is a brain path?

A brain path is a canonical address to access human brain data. The goal is to have an query schema that enables consistent and dynamic referencing across diverse modalities and datasets, and ultimately support implementing a large data lakehouse for human brain data that can be queried and accessed in a standard way.

A brain path has two layers:

  • Raw layer (native URIs): the original data, addressed by their own path (https:, s3:, file:). Raw data has no canonical structure, so it keeps the provider layout. It can be in any format and organization. The raw layer is the source of truth for provenance, and it is what gets accessed when you call .raw() on API objects.

  • Canonical layer (brain:// scheme): the logical identity of the data, organized by a controlled vocabulary and universal subject identifiers. The canonical layer is independent of where the bytes physically live. It provides a consistent way to reference data across datasets and modalities.

The query planner maps each brain:// path to its raw source(s) and the processing steps needed to derive the requested representation. It does this as an explicit staged pipeline (parse → normalize → resolve → expand → match → plan → bind → slice → return); see Query planning. The canonical layer is the primary interface for users and applications, while the raw layer is the source of truth for data access and provenance.

A canonical address has the form

brain://{catalog?}/{subjects}/:modality/:space/:dtype/{:qualifiers}/@coords

The {catalog} is the namespace that resolves the path, and it always occupies the authority position of the URI. In the bare-local form the authority is empty (brain:///...), which selects the default local catalog. To resolve the same address from a named or remote catalog, pair the scheme with a transport and put the catalog in the authority:

brain+https://{catalog}/{subjects}/:modality/:space/:dtype/{:qualifiers}/@coords
brain+s3://{catalog}/{subjects}/:modality/:space/:dtype/{:qualifiers}/@coords
brain+file://{catalog}/{subjects}/:modality/:space/:dtype/{:qualifiers}/@coords

This pairs a logical scheme (brain) with a concrete transport (e.g. https). The path after the authority is identical in every form, and {subjects} is always its first segment. See Catalog and subject.

Examples:

  • Raw (native URI): https://openneuro.org/datasets/ds002158/snapshots/1.0.2/files/sub-102/...
  • Canonical (local): brain:///hcp-100307/:eeg/:native/:voltage/:rest/@ch=Cz
  • Canonical (all subjects, local): brain:///*/:fmri/:mni152/:bold/:rest/@*
  • Canonical (remote catalog): brain+https://omnibrain.org/hcp-100307/:fmri/:mni152/:bold/:rest/@*

Scheme and transport

Form Resolves against
brain:///... the default local catalog (empty authority)
brain+https://{catalog}/... the named catalog over HTTPS
brain+s3://{catalog}/... the named catalog over S3
brain+file://{catalog}/... the named catalog over a local file store

The catalog is always the authority; the transport only says how to reach it. The same canonical content is reachable over several transports without changing its identity.

Raw data

Raw bytes keep the provider’s own locator, stored as a native URI:

https://openneuro.org/crn/datasets/ds002158/snapshots/1.0.2/files/...
s3://openneuro.org/ds002158/sub-102/...
file:///mnt/xcit-h2/ABIDE2-RawData/sub-102/...
  • A bare local path (/mnt/...) is normalized to file:///mnt/... so it is a valid URI.
  • An optional raw+https://... tag self-types a raw locator when it travels outside the catalog. Inside the catalog the namespace is already known, so raw is stored as its plain native URI.
  • Raw paths intentionally preserve the data provider layout. They do not enforce a vocabulary or specific structure.

Canonical structure

The canonical address is organized by universal subject identifiers:

brain://{catalog?}/{subjects}/:modality/:space/:dtype/{:qualifiers}/@coords

where:

  • {catalog?}: the resolving namespace, in the authority position; empty (brain:///...) selects the default local catalog. See Catalog and subject.

  • {subjects}: universal subject identifier (includes the dataset prefix). See Subject identifiers. This can be one or more subjects (e.g., hcp-100307, hcp-100307,hcp-100408) separated by commas.

  • the segments following {subjects} are controlled vocabulary terms that describe the data. They are separated by / and must be prefixed with : to indicate they are from the canonical vocabulary. Vocabulary terms are case-insensitive and normalized to lowercase (:MNI152 and :mni152 denote the same term). The required segments are:

    • :modality: modality term, e.g., :fmri, :eeg
    • :space: reference/registration space term, e.g., :mni152, :native
    • :dtype: data type, e.g., :bold, :voltage

And the optional segments are:

  • {:qualifiers}: optional list of qualifiers, e.g., :denoised, :rest, :task
  • @coords: coordinate selector (use @* to request all), e.g., @xyz=32,45,12;t=0:1200 for a voxel time series or @ch=Cz for an EEG channel

Catalog and subject

The catalog and the subject answer two different questions, and the draft used to blur them:

  • {catalog} is where to resolve the path — the namespace/host that holds the index and the bytes. It is always the URI authority: empty for the default local catalog, or named for a curated/remote collection (brain+https://omnibrain.org/...). A catalog is an orthogonal grouping; it does not identify provenance.
  • {subjects} is what data — the dataset-prefixed universal id (hcp-100307). The dataset prefix inside the id (hcp) is what uniquely identifies the source dataset and provenance, regardless of which catalog serves it.

So the same subject can live in more than one catalog, and a single catalog can serve many datasets; the subject id stays stable across both.

The brain path uses these special elements:

Element Meaning Notes
brain: Canonical scheme Derived/processed data. The address is a valid URI.
+{transport} Transport for a named/remote catalog brain+https://, brain+s3://, brain+file://. Bare brain:/// is the default local catalog.
//{catalog}/ Catalog (authority) Namespace/host that resolves the path; empty (///) means the default local catalog.
: Controlled vocabulary term Segment must be valid in the canonical vocabulary (or resolvable to it).
! Unresolved term Placeholder for unknowns; queryable and later resolvable during ingestion. Replaces ~ (which collided with the shell/OS home-directory sigil) and ? (reserved as the URI query delimiter).
@ Coordinate selector Explicit selector for spatial/time/stream coordinates. See Coordinates.

Path segments

Segment Required Type Description
{catalog} optional authority Resolving namespace/host; empty selects the default local catalog
{subjects} string Canonical subject ID (e.g., hcp-100307)
:modality vocab Modality (e.g., :fmri, :t1w, :eeg)
:space vocab Space (e.g., :mni152, :native)
:dtype vocab Representation (e.g., :bold, :intensity, :voltage)
{:qualifiers} optional vocab list Additional canonical qualifiers (task, processing, etc.)
@coords optional selector Coordinate/stream selector (defaults to @*)

Subject identifiers

A deterministic canonical subject id is preferred to enable consistent referencing across datasets:

"{dataset_prefix}-{clean_id}"
  • {dataset_prefix}: canonical dataset code (e.g., hcp)
  • {clean_id}: dataset subject identifier normalized into a stable form

Example: hcp-100307

Qualifiers

Qualifiers are optional additional segments that provide more specific information about the data.

brain:///{subjects}/:modality/:space/:dtype/:qual1/:qual2/.../@coords

Typical qualifier families:

  • acquisition/condition: :rest, :task, :eyes-open, :eyes-closed
  • processing: :denoised, :filtered, :source-localized
  • feature forms: :parcellated, :roi-mean, :embedding

Coordinates

Coordinates represent spatial, temporal, and stream indexing. They are expressed using @... syntax at the end of the path. The selector keys (xyz, t, ch) follow the W3C Media Fragments URI convention, and the ; separator keeps a multi-axis selector inside a single path segment. The interpretation of the coordinates depends on the modality, space, and dtype.

The units of @xyz are fixed by :space: a standard space (e.g. :mni152) implies millimetres in that space’s reference frame, while :native implies the native voxel index of the subject’s own grid. t indexes time in samples/volumes (e.g. fMRI volume index, EEG sample).

Form Meaning
@* entire data
@xyz=-42,38,12 spatial point. Units defined by :space (:mni152 ⇒ mm; :native ⇒ voxel index).
@xyz=-42,38,12;t=0:1200 spatial point + time range. t indexes time (fMRI volume index, EEG sample).
@xyz=-42:40,30:50,10:20 spatial bounding box (each axis given as lo:hi).
@t=0:1200 time range only (e.g., a whole-brain time series).
@ch=Cz named stream selector (channel, parcel, or variable). Modality-dependent; maps to a named axis.

Canonical vs Raw

Aspect Canonical (brain:) Raw (native URI)
Scheme brain: / brain+{transport}: provider-native (https:, s3:, file:)
Structure fixed schema dataset-defined
Subject ID deterministic universal ID dataset convention
Vocabulary enforced none
Coordinates explicit @... selector none (native implied)
Role logical identity, location-independent physical bytes, where the provider put them

Examples

Canonical paths (local catalog):

brain:///hcp-100307/:fmri/:mni152/:bold/:rest/@*
brain:///hcp-100307/:fmri/:mni152/:bold/:rest/:denoised/@xyz=-42,38,12;t=0:1200
brain:///hcp-100307/:t1w/:mni152/:intensity/@*
brain:///hcp-100307/:eeg/:native/:voltage/:rest/@ch=Cz
brain:///hcp-100307/:multimodal/:mni152/:embedding/:rest/@*
brain:///*/:fmri/:mni152/:bold/:rest/@*

The same content resolved from a remote catalog, over different transports:

brain+https://omnirest.xcit.org/hcp-100307/:fmri/:mni152/:bold/:rest/@*
brain+s3://omni-federation/hcp-100307/:fmri/:mni152/:bold/:rest/@*

The raw sources the hcp-100307 canonical paths resolve to (native URIs from the HCP dataset):

https://db.humanconnectome.org/data/projects/HCP_1200/subjects/100307/...
s3://hcp-openaccess/HCP_1200/100307/...
file:///mnt/xcit-h2/HCP-RawData/100307/...

Query patterns (API examples)

Assume an API with:

  • dataset.query(pattern: str) -> list[path]
  • dataset.get(path: str) -> object
  • object.raw

All resting fMRI in standard space (local)

dataset.query("brain:///*/:fmri/:mni152/:bold/:rest/@*")

All resting fMRI in a remote catalog

dataset.query("brain+https://omnirest.xcit.org/*/:fmri/:mni152/:bold/:rest/@*")

Specific voxel time series

dataset.get("brain:///hcp-100307/:fmri/:mni152/:bold/:rest/@xyz=-42,38,12;t=0:1200")

Trace back provenance to the raw source

dataset.get("brain:///hcp-100307/:fmri/:mni152/:bold/:rest/@*").raw
# -> "file:///mnt/xcit-h2/HCP-RawData/100307/..." (a native URI)

Query planning

A brain:// address is not resolved in one step. The query planner breaks it into an explicit pipeline of nine stages, each with a typed input and output. dataset.query() runs stages 1–8 and returns lazy path handles; dataset.get() runs the same stages for a single path and then executes the plan to return the object.

flowchart TD
  A["1. Parse<br/>address → AST"] --> B["2. Normalize<br/>AST → canonical AST"]
  B --> C["3. Resolve vocabulary<br/>bind :terms; mark ! and *"]
  C --> D["4. Expand<br/>wildcards/lists → candidate paths"]
  D --> E["5. Match catalog<br/>artifact vs. raw + recipe"]
  E --> F["6. Plan derivation<br/>raw → steps → canonical DAG"]
  F --> G["7. Bind raw sources<br/>attach native URIs (.raw)"]
  G --> H["8. Apply coordinates<br/>push @coords as lazy slice"]
  H --> I["9. Return / materialize<br/>query: handles · get: object"]

  1. Parse — lex the address into typed components {scheme, transport, catalog, subjects[], modality, space, dtype, qualifiers[], coords}. Reject a literal ? or #. Output: an AST.
  2. Normalize — lowercase vocabulary terms, canonicalize subject ids, normalize bare/file: raw paths, and default a missing @coords to @*. Output: a canonical AST.
  3. Resolve vocabulary — bind each :term to the controlled vocabulary; leave !term (unresolved) and * (wildcard) as open holes. Output: AST annotated resolved | unresolved | wildcard per segment.
  4. Expand — turn wildcards, multi-subject lists, and catalog scope into a concrete candidate set by matching against the catalog index (the datasets.yml-style inventory). Output: a list of fully-ground candidate canonical paths.
  5. Match catalog - for each candidate, look up whether a materialized derivative already exists (e.g. an fMRIPrep output), a partial derivative does (preprocessed but not yet denoised) that can seed the plan, or only raw bytes plus a derivation recipe. Reusing an existing derivative instead of recomputing is a cache hit - the same reuse a SQL optimizer does with a materialized view. Output: (path, derivative | partial | recipe) triples.
  6. Plan derivation - for non-materialized paths, assemble the processing DAG (raw -> steps -> canonical) from a registry of typed transforms. Each transform declares its precondition (the representation it consumes), effect (what it produces), cost, and implementation; the planner selects the transforms whose effects reach the requested :space / :dtype / :qualifiers and orders them by their preconditions (e.g. registration to :mni152 precedes denoising, because denoising consumes data already in its analysis space), rather than following a hand-written per-modality script. Step order is therefore derived, not fixed. Output: a per-path plan DAG.
  7. Bind raw sources — attach the native URIs each plan reads from; this is exactly what .raw returns. Output: plan + provenance.
  8. Apply coordinates — push the @coords selector down as a lazy slice on the resolved artifact (voxel/surface/stream + time), validated against :modality / :space / :dtype. Output: a sliced lazy plan.
  9. Return / materializequery() returns the list of resolved path handles (lazy plans); get() executes one plan and returns the object.

Worked trace

Take brain:///hcp-100307/:fmri/:mni152/:bold/:rest/:denoised/@xyz=-42,38,12;t=0:1200:

  1. Parse{catalog: <local>, subjects: [hcp-100307], modality: fmri, space: mni152, dtype: bold, qualifiers: [rest, denoised], coords: xyz=-42,38,12;t=0:1200}.
  2. Normalize → terms already lowercase; coords kept; subject id already canonical.
  3. Resolve vocabulary → every :term binds; no holes.
  4. Expand → a single concrete subject, so one candidate path.
  5. Match catalog:denoised is not materialized; the catalog returns the raw :bold source plus a denoising recipe.
  6. Plan derivation → DAG: raw bold → register to :mni152 → denoise.
  7. Bind raw sources.raw = file:///mnt/xcit-h2/HCP-RawData/100307/....
  8. Apply coordinates → slice voxel (-42, 38, 12) mm (mm because :space = :mni152) over volumes 0:1200.
  9. Returnget() executes the DAG and returns the sliced time series.

Where queries diverge at stages 3–4:

  • A wildcard query (brain:///*/:fmri/:mni152/:bold/:rest/@*) leaves subjects as * at stage 3, so stage 4 expands it into every matching subject in the catalog — many candidate paths instead of one.
  • An unresolved query (brain:///*/!weirdmodality) leaves the !weirdmodality segment as an open hole at stage 3; stage 4 matches only catalog entries that carry that not-yet-mapped term, which is how unresolved terms stay queryable until ingestion maps them.

Interactive plan visualizer

The flowchart above is fixed. The query plan visualizer is live: enter any brain:// address (or pick an example) and watch it parse into segments, then derive the stage-6 graph by searching the transform registry (shown on the page) instead of dispatching a fixed per-modality recipe. Toggling the derivatives cache shows the stage-5 cache hit, where an existing derivative lets the planner skip the upstream steps.

Unresolved terms (!)

Unknown or not-yet-mapped terms are prefixed with !. The ! sigil replaces the earlier ~ (which collided with the shell/OS home-directory sigil, making paths awkward to type and copy) and ? (reserved as the URI query delimiter).

Query: all paths containing unknown terms

dataset.query("brain:///*/!*")

Query: a specific unknown term across datasets

dataset.query("brain:///*/!weirdmodality")

Query: fully resolved paths only

dataset.query("brain:///*/:*/:*/:*/@*")

Validation

  1. Scheme and transport: a canonical address must parse as a URI with scheme brain or brain+{transport}. Bare brain:///... denotes the default local catalog; brain+{transport}://{catalog}/... denotes a named catalog over that transport.

  2. Raw locators: a raw source must be a valid URI in its native scheme (https:, s3:, file:). A bare path is normalized to file:///....

  3. Canonical vocabulary (:): segments prefixed with : must be members of (or resolvable to) the canonical vocabulary.

  4. Coordinates (@): coordinate selectors must be valid with respect to:

  • :modality (voxels, surface, streams)
  • :space (units, bounds)
  • :dtype (temporal indexing)
  1. Reserved characters: the path must not contain a literal ? or # (reserved as the URI query and fragment delimiters); unresolved terms use !.

  2. Use @* when requesting the entire object/stream.

Summary

  • brain:///... is the canonical, location-independent address; an empty authority is the default local catalog, and brain+{transport}://{catalog}/... resolves it from a named catalog. The catalog (where) and the subject’s dataset prefix (provenance) are independent.
  • Raw bytes keep their native URI (https:, s3:, file:); the catalog maps canonical paths to raw sources, and .raw returns the native URI.
  • : marks canonical terms, ! marks unresolved terms, and @ provides coordinate-aware indexing with Media Fragments-style keys.
  • The query planner turns a brain:// address into raw sources plus a derivation plan through an explicit nine-stage pipeline; see Query planning.