Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model
Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter

TL;DR
Brain-Semantoks introduces a self-supervised framework that learns robust, high-level brain dynamic representations from noisy fMRI data, improving downstream task performance and out-of-distribution generalization.
Contribution
The paper presents a novel semantic tokenizer and self-distillation training method tailored for brain dynamics, enabling more stable and meaningful representations from noisy fMRI signals.
Findings
Effective in downstream tasks with linear probes
Outperforms existing models in out-of-distribution scenarios
Scaling with more unlabeled data improves performance
Abstract
The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through…
Peer Reviews
Decision·ICLR 2026 Poster
- The model is evaluated on multiple datasets, including UK Biobank, ABIDE, HBN, SRPBS, and LEMON, and shows consistent improvements over strong baselines such as BrainLM and Brain-JEPA. - The paper also includes extensive ablation studies on the effects of the semantic tokenizer design, temporal regularizer duration, masking type, loss components, and masking ratio.
- A main weakness is that all experiments rely solely on resting-state fMRI data, which limits the claim of being a true foundation model. - Tables 1 and 2 need to have a statistical comparison between the best model and other models. It is commonly done with a pairwise test. Then, p-values are usually corrected for multiple comparisons.
• Innovative use of self-distillation and semantic tokenization to learn stable, abstract representations of brain activity. • Clear performance improvements over existing foundation models (e.g., BrainLM, Brain-JEPA). • Semantic tokenizer proves particularly effective for demographic and clinical predictions (age, sex, ASD). • The shift from low-level voxel embeddings to network-based embeddings is conceptually strong. • Significant ablation studies to explore the benefit of each component into
• While results are solid, gains from other architectural components beyond the tokenizer are more modest. • The paper could discuss temporal resolution more thoroughly — would finer sampling (e.g., sub-2s TR) lead to better representations or unnecessary noise amplification? • The work focuses exclusively on resting-state data; some commentary on potential extension to task-based fMRI would strengthen the contribution.
- Clear reframing toward semantic abstraction with a neuroscience‑grounded tokenizer operating at functional network granularity, reducing token length and noise while injecting inductive bias. - The slice masking to avoid trivial interpolation is a strong regularization that forces the model to learn meaningful relationships between tokens. - Well‑designed curriculum via TTR that averages network tokens over time early in training, improving stability of the model during training - Rigorous
- The atlas choice is mostly arbitrary. No analysis of how results change with alternative parcellations (Schaefer, Shen, Yeo‑17) or different subcortical/cerebellar groupings; no exploration of data‑driven network discovery to justify the choice of nine functional networks. - The geometry of learned network identity embeddings is not analyzed; it is unclear whether they capture canonical inter‑network relationships or known hierarchies. - Precise kernel sequences and decay parameters are uncl
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies · EEG and Brain-Computer Interfaces · Generative Adversarial Networks and Image Synthesis
