Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs
Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi

TL;DR
This paper introduces Concept Component Analysis (ConCA), a theoretically grounded, unsupervised framework for extracting human-interpretable concepts from LLM representations, addressing limitations of previous methods like sparse autoencoders.
Contribution
The paper proposes ConCA, a novel principled approach based on linear unmixing and latent variable modeling for concept extraction in LLMs, with theoretical justification and practical variants.
Findings
ConCA effectively extracts meaningful concepts across multiple LLMs.
Sparse ConCA variants outperform traditional autoencoders in interpretability.
Theoretical analysis clarifies the relationship between LLM representations and human concepts.
Abstract
Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs' activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM…
Peer Reviews
Decision·Submitted to ICLR 2026
I like the framing of this paper. It starts with a thoughtful feature-latent space definition, which allows arbitrary interactions between features, and then argues for why the linear process might be true in this case. This is a stronger case than assuming that concepts are linearly encoded a priori.
W1 I do not feel that this paper offers a significant amount in terms of either contributing something to how we understand that models work, or contributing a very useful tool that will be widely used by interpretability. I feel that ConCA is well-theoretically motivated, but in practice I’m not sure what to take from it, or if there is a reason to switch to using it instead of other dictionary learning methods. W2 The empirical results do not have too many widely-applicable takeaways. It wou
I loved this paper. The writing is excellent and the story reads well. Congrats. Some positive points (P) that I will note here: P1. Clean theoretical through line. Figure 1 and Theorem 2.1 tie a latent variable model of text to the next token objective and yield the linear mixture $f(x)$ approximately equal to $A$ times stacked log posteriors plus $b$. This give an implicit definition of a concept (latent factor organizing the data manifold), gives a crisp target for what a concept feature sho
Ok, now for the weaknesses that I found in the paper, i'll group them in Major (M) and minor (m). M1. Clarify what the mixing matrix implies and connect it explicitly to the Linear Representation Hypothesis (LRH). Equation 3 shows a linear mixture $f(x)$ equals $A$ times the stacked log posteriors plus $b$. This already implies that the encoding is linear in concepts up to an unknown mixing matrix, which is closely aligned with the linear representation hypothesis. I suggest stating this implic
- The theoretical contribution of concepts as linear combinations of logposteriors of latent concepts. - It is great to see push back on the fields assumptions about linearity/sparsity/SAE usage generally and new modes of thought are exciting to see! - The writing and figures are clean with little to no grammatical issues
- The paper claims that the exp function is too numerically unstable to use, which is fine, but there is very limited analysis or exploration of whether approximating exp with another function (e.g., SELU, ELU, SoftPlus, etc) are viable alternatives which follow the theoretical motivations. - Do we actually want to interpret only the things that are the underlying generative variables of the data? It seems to me like the models themselves don't learn the true causal variables and succumb to sp
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
