Sparse Attention Post-Training for Mechanistic Interpretability

Florent Draye; Anson Lei; Hsiao-Ru Pan; Ingmar Posner; Bernhard Sch\"olkopf

arXiv:2512.05865·cs.LG·March 6, 2026

Sparse Attention Post-Training for Mechanistic Interpretability

Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Sch\"olkopf

PDF

Open Access

TL;DR

This paper presents a post-training method to induce extreme sparsity in transformer attention, maintaining performance while significantly simplifying the model's internal connectivity for better interpretability.

Contribution

It introduces a flexible sparsity regularization technique that drastically reduces attention edges without performance loss, enhancing model interpretability and circuit simplicity.

Findings

01

Attention connectivity reduced to 0.4% of edges

02

Task-specific circuits involve up to 100x fewer connections

03

Sparse attention simplifies attribution and circuit analysis

Abstract

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Machine Learning in Materials Science