Beyond Black Boxes: Enhancing Interpretability of Transformers Trained on Neural Data
Laurence Freeman, Philip Shamash, Vinam Arora, Caswell Barry, Tiago Branco, Eva Dyer

TL;DR
This paper integrates sparse autoencoders with transformer models decoding neural activity, improving interpretability by revealing variable-specific internal units without sacrificing decoding performance.
Contribution
It introduces a novel method combining SAEs with transformers for neural decoding, enhancing interpretability of internal representations in neuroscience models.
Findings
Hidden units respond selectively to stimulus features.
Ablation of units impairs variable-specific decoding.
Model performance remains unchanged with SAE integration.
Abstract
Transformer models have become state-of-the-art in decoding stimuli and behavior from neural activity, significantly advancing neuroscience research. Yet greater transparency in their decision-making processes would substantially enhance their utility in scientific and clinical contexts. Sparse autoencoders offer a promising solution by producing hidden units that respond selectively to specific variables, enhancing interpretability. Here, we introduce SAEs into a neural decoding framework by augmenting a transformer trained to predict visual stimuli from calcium imaging in the mouse visual cortex. The enhancement of the transformer model with an SAE preserved its original performance while yielding hidden units that selectively responded to interpretable features, such as stimulus orientation and genetic background. Furthermore, ablating units associated with a given variable impaired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Neural Networks and Applications
