Latent Debate: A Surrogate Framework for Interpreting LLM Thinking
Lihu Chen, Xiang Yin, Francesca Toni

TL;DR
This paper introduces latent debate, a framework that interprets LLM internal reasoning by capturing hidden supporting and attacking signals within a single model, aiding in understanding hallucinations and internal mechanisms.
Contribution
The paper proposes a novel, model- and task-agnostic latent debate framework that interprets LLM thinking and detects hallucinations through internal argument signals.
Findings
Latent debate aligns closely with original LLM predictions.
High latent debate activity correlates with increased hallucination risk.
Framework provides a new approach for understanding LLM internal processes.
Abstract
Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is clearly written, results are presented clearly, and it has helpful diagrams. The fact that the amount of disagreement between true/false directionality within a model is predictive of hallucination is a interesting finding.
The results require better baselines: (1) using only the rightmost hidden state from layer L-1, since that is the 'closest' to where the actual prediction happens. (2) compute the consistency score for all arguments, and present the max. Without that, I don't think the benefit is properly established: to be beneficial, the outcome of the QBAF procedure should be more predictive than the variables that go into it. The same goes for the hallucination detection, where the results are not compar
- The conceptual framework is intuitive and well-motivated by argumentation theory. - While the logit lens technique used in instantiation is not novel, the experiments provide solid support for the argumentation-based approach.
- For the instantiation, it is unclear why all tokens are treated as thinking steps (Line 288). In LLMs, only the final token attends to all previous ones. If earlier token representations lack full contextual information, it is questionable how they can form meaningful arguments about sentence correctness. - The writing lacks clarity. More descriptive figure captions would improve readability, rather than relying solely on definitions in the main text. The authors should also clearly define key
1. The overall idea sounds interesting. 2. The framework is training-free and has no intensive computational burden. 3. The authors incorporated ablation studies and some hallucination analysis on their approach.
1. **Evaluation looks toyish; baselines are too weak; no comparison to close related methods** * Datasets are toy datasets (500 each) rather than large-scale, realistic NLP benchmarks; the setting feels **too toy** to support their claims. * Baselines are **self-constructed and too simple**. Although the authors discussed DoLa / Logit Lens / Internal Consistency approaches briefly in text, which is good, they did not compare their method to these baselines. 2. **Writing quality is poor;
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Misinformation and Its Impacts
