Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

William Lehn-Schi{\o}ler; Magnus Ruud Kj{\ae}r; Rahul Thapa; Magnus Guldberg Pedersen; Anton Mosquera Storgaard; Nick Williams; Radu Gatej; Tue Lehn-Schi{\o}ler; S\'andor Beniczky; Sadasivan Puthusserypady; James Zou; Lars Kai Hansen

arXiv:2605.13930·cs.LG·May 18, 2026

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

William Lehn-Schi{\o}ler, Magnus Ruud Kj{\ae}r, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schi{\o}ler, S\'andor Beniczky, Sadasivan Puthusserypady, James Zou, Lars Kai Hansen

PDF

TL;DR

This paper investigates the internal representations of EEG foundation models using sparse autoencoders, revealing their interpretability, entanglement issues, and physiological relevance through spectral decoding.

Contribution

It introduces a framework applying TopK Sparse Autoencoders to interpret EEG models, benchmarking their representations, and analyzing their clinical and physiological implications.

Findings

01

Identified three operational regimes: steerable, entangled, and non-encoded concepts.

02

Exposed representational failures like 'wrecking-ball' interventions and confounding entanglements.

03

Mapped interventions to physiologically interpretable spectral signatures.

Abstract

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.