TopK Language Models
Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling

TL;DR
This paper introduces TopK LMs, a modified transformer architecture with integrated TopK activation, enhancing interpretability and stability without sacrificing performance, thus improving understanding of how language models learn and represent concepts.
Contribution
The paper proposes a novel transformer modification with TopK activation, enabling direct interpretability of hidden states and stable analysis across training checkpoints.
Findings
TopK LMs maintain original capabilities.
They enable targeted neuron interventions.
They facilitate detailed analysis of neuron formation.
Abstract
Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
