Automated Attention Pattern Discovery at Scale in Large Language Models
Jonathan Katzy, Razvan-Mihai Popescu, Erik Mekkes, Arie van Deursen, Maliheh Izadi

TL;DR
This paper introduces AP-MAE, a vision transformer-based model that reconstructs and analyzes attention patterns in large language models, demonstrating their usefulness for scalable interpretability and targeted interventions.
Contribution
It presents a novel vision transformer approach for reconstructing attention patterns, enabling scalable interpretability and transferability across models.
Findings
AP-MAE accurately reconstructs masked attention patterns.
It generalizes well to unseen models with minimal performance loss.
Attention patterns can predict generation correctness and guide interventions.
Abstract
Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
