Brain-Grounded Axes for Reading and Steering LLM States
Sandro Andric

TL;DR
This paper introduces a novel approach to interpret and steer large language models using axes derived from human brain activity, providing neurophysiologically grounded and robust interpretability tools.
Contribution
It proposes using human brain data as a coordinate system for reading and steering LLM states, creating stable, interpretable axes without fine-tuning the models.
Findings
Brain-derived axes correlate with word frequency and content.
Steering along these axes affects LLM behavior consistently across models.
The approach offers a neurophysiologically grounded interpretability method.
Abstract
Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Multimodal Machine Learning Applications
