Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Karan Bali, Jack Stanley, Praneet Suresh, Danilo Bzdok

TL;DR
This paper investigates the stability of transformer attention-heads across different training runs, revealing insights into their robustness, importance, and the effects of regularization, which are crucial for safe and interpretable AI systems.
Contribution
It provides a systematic analysis of attention-head stability across multiple models and training instances, highlighting factors influencing stability and implications for interpretability.
Findings
Middle-layer heads are least stable but most distinct.
Deeper models show greater divergence in mid-depth layers.
Weight decay improves attention-head stability.
Abstract
In mechanistic interpretability, recent work scrutinizes transformer "circuits" - sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid-tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety-critical settings. Here, we systematically study stability across-refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle-layer heads are the least stable yet the most representationally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Advanced Graph Neural Networks
