Latent Concept Disentanglement in Transformer-based Language Models

Guan Zhe Hong; Bhavya Vasudeva; Vatsal Sharan; Cyrus Rashtchian; Prabhakar Raghavan; Rina Panigrahy

arXiv:2506.16975·cs.LG·September 29, 2025

Latent Concept Disentanglement in Transformer-based Language Models

Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how transformer-based language models represent and disentangle latent concepts during in-context learning, revealing their ability to identify and utilize latent structures in reasoning tasks.

Contribution

It demonstrates that transformers can successfully identify, disentangle, and utilize latent concepts in various reasoning tasks, advancing understanding of their internal representations.

Findings

01

Models identify latent concepts in reasoning tasks.

02

Low-dimensional subspaces reflect underlying parameters.

03

Transformers effectively use in-context learned concepts.

Abstract

When large language models (LLMs) use in-context learning (ICL) to solve a new task, they must infer latent concepts from demonstration examples. This raises the question of whether and how transformers represent latent structures as part of their computation. Our work experiments with several controlled tasks, studying this question using mechanistic interpretability. First, we show that in transitive reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. This builds upon prior work that analyzes single-step reasoning. Then, we consider tasks parameterized by a latent numerical concept. We discover low-dimensional subspaces in the model's representation space, where the geometry cleanly reflects the underlying parameterization. Overall, we show that small and large models can indeed disentangle…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper takes a systematic and well-executed approach to analyzing internal mechanisms of ICL. - The approach offers interpretable, fine-grained insight into how contextual information propagates, complementing existing representation-based probing techniques. - The study spans synthetic geometric reasoning tasks and natural language problems, demonstrating that the framework generalizes across distinct domains of conceptual structure. - The visual analyses (mediation heatmaps and interven

Weaknesses

- The paper treats latent concepts as interpretable dimensions or attention patterns but never defines them formally. - The causal model in the introduction ($F = R \circ C$) is not linked to the implemented CMA pipeline. $R$ and $C$ are just introduced without definitions. There is no derivation showing that the empirical mediation quantities estimated from activations correspond to components of this decomposition. - Experiments are mostly conducted on synthetic or simplified reasoning dat

Reviewer 02Rating 6Confidence 3

Strengths

The exploration of how LLMs understand and reason at the conceptual level is interesting and provides valuable insights for the community. The use of causal mediation analysis and PCA visualization to validate the claims is persuasive under the synthetic tasks presented in the paper. The experimental design is rigorous and highly interpretable. In particular, the demonstration of low-dimensional manifold structures in the mode’s representation space (i.e., the geometric interpretability of laten

Weaknesses

The tasks and experimental setups are overly idealized, relying almost entirely on highly synthetic toy tasks, which do not represent real-world natural language reasoning tasks. Of course, this is a common issue in the interpretability field. The analysis of model scale and generalization is insufficient. The paper only compares Gemma-2-27B and 2B, without systematically examining whether the same mechanisms hold across different architectures (e.g., LLaMA, Qwen series) or larger-scale models.

Reviewer 03Rating 6Confidence 3

Strengths

Originality * The unorthodox evaluation methodology of using activation patching with counterfeit examples is creative. * The technique of using interpolation (referred in paper as steering) in this setting is innovative and establishes the geometry of the underlying latent variable. * The paper discusses the findings of previous studies in the literature survey section and clearly state the novel contributions of this work. Quality * Overall, I find the paper to be of good quality, the meth

Weaknesses

* In the “memorized world knowledge” task, the authors claim that the LLM relies on step-by-step composition of latent concept. However, all their experiments are of the kind where the latent concept has a one-to-one mapping with the output class (for instance country → capital). While, there is no problem with this choice, it might be helpful to study different scenarios (such as country (bridge) → famous politicians from that country) * In Figure 10, for the circular trajectory problem, even w

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications