Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Ankur Garg; Xuemin Yu; Hassan Sajjad; Samira Ebrahimi Kahou

arXiv:2506.20040·cs.LG·July 18, 2025

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

PDF

Open Access

TL;DR

This paper introduces CLVQ-VAE, a novel framework that uncovers and interprets emergent concepts across transformer layers by collapsing redundant features into meaningful vectors, improving understanding of language model internal representations.

Contribution

We propose CLVQ-VAE, a cross-layer vector quantization method that captures and interprets emergent concepts in language models, addressing limitations of single-layer analysis.

Findings

01

Effectively collapses redundant features into interpretable concept vectors

02

Uses top-k sampling and EMA updates for controlled discrete space exploration

03

Clusters representations by directional similarity, aligning with semantic structures

Abstract

Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose cross-layer VQ-VAE (CLVQ-VAE), a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-k temperature-based sampling during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Rough Sets and Fuzzy Logic · Text and Document Classification Technologies

MethodsSparse Evolutionary Training