Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
William Lehn-Schi{\o}ler, Magnus Ruud Kj{\ae}r, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schi{\o}ler, S\'andor Beniczky, Sadasivan Puthusserypady, James Zou, Lars Kai Hansen

TL;DR
This paper investigates the internal representations of EEG foundation models using sparse autoencoders, revealing their interpretability, entanglement issues, and physiological relevance through spectral decoding.
Contribution
It introduces a framework applying TopK Sparse Autoencoders to interpret EEG models, benchmarking their representations, and analyzing their clinical and physiological implications.
Findings
Identified three operational regimes: steerable, entangled, and non-encoded concepts.
Exposed representational failures like 'wrecking-ball' interventions and confounding entanglements.
Mapped interventions to physiologically interpretable spectral signatures.
Abstract
EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
