Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

TL;DR
This paper introduces CLVQ-VAE, a novel framework that uncovers and interprets emergent concepts across transformer layers by collapsing redundant features into meaningful vectors, improving understanding of language model internal representations.
Contribution
We propose CLVQ-VAE, a cross-layer vector quantization method that captures and interprets emergent concepts in language models, addressing limitations of single-layer analysis.
Findings
Effectively collapses redundant features into interpretable concept vectors
Uses top-k sampling and EMA updates for controlled discrete space exploration
Clusters representations by directional similarity, aligning with semantic structures
Abstract
Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose cross-layer VQ-VAE (CLVQ-VAE), a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-k temperature-based sampling during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Rough Sets and Fuzzy Logic · Text and Document Classification Technologies
MethodsSparse Evolutionary Training
