Sparse Attention Post-Training for Mechanistic Interpretability
Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Sch\"olkopf

TL;DR
This paper presents a post-training method to induce extreme sparsity in transformer attention, maintaining performance while significantly simplifying the model's internal connectivity for better interpretability.
Contribution
It introduces a flexible sparsity regularization technique that drastically reduces attention edges without performance loss, enhancing model interpretability and circuit simplicity.
Findings
Attention connectivity reduced to 0.4% of edges
Task-specific circuits involve up to 100x fewer connections
Sparse attention simplifies attribution and circuit analysis
Abstract
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Machine Learning in Materials Science
