TL;DR
This paper introduces a novel approach to improve the accuracy and trustworthiness of chain-of-thought reasoning in language models by inferring the veracity of each reasoning step using a latent variable model and a search algorithm.
Contribution
It proposes Veracity Search and Amortized Veracity Inference methods to identify errors in reasoning chains, enabling zero-shot error detection and self-correction in language models.
Findings
VS reliably detects errors across multiple reasoning benchmarks.
AVI achieves comparable zero-shot accuracy in veracity inference.
Latent veracity inference aids in self-correction and self-improvement.
Abstract
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity…
Peer Reviews
Decision·ICLR 2026 Poster
(1) Originality: The methods proposed for improving the CoT are innovative. (2) Quality: The combination of VS and AVI is well-motivated, and experiments are carefully designed across logical, mathematical, and commonsense reasoning tasks. (3) Clarity: The paper provides clear definitions, and comprehensive experiments. (4) Significance: Identifying and correcting reasoning errors is an important challenge for improving the reliability of LMs, and this work provides a promising direction.
(1) Many experiments rely on artificially corrupted reasoning chains. It would be valuable to see more extensive evaluation on naturally generated CoTs (this is the real-world use case). (2) The research on the impact of reasoning time is somewhat lacking. (3) The AVI is dependent on VS, but VS itself may not able to guarantee accuracy. This influence is not analyzed.
1. strong performance gain 2. label-efficient step verification, the proposed veracity search use LLM's joint likelihood, avoiding expensive step-level supervision.
1. Most tests use artificially corrupted chains; evidence on naturally occurring errors is limited and needs broader experiments. 2. The joint-likelihood reward correlates with true veracity but not perfectly (Pearson 0.56–0.74), so misrankings can occur. 3. the computation efficiency can be improved, strong VS performance often use tens to 100 samples, this add-on computation cost compared with single-pass verifiers.
1. **Conceptual clarity and originality**: The paper disentangles reasoning content and correctness via a latent variable formulation. It provides a neat probabilistic framing of step-wise error identification in chain of thoughts. 2. **veracity search is intuitive**: and effective inference-time algorithm. It outperforms simple prompting-based verifiers across different benchmarks. 3. **Comprehensive evaluation.** The paper is well written, and the experiments are well-organized, and includes
1. **Reliance on prompting LMs for veracity scoring:** The entire framework assumes that the LM can reliably evaluate joint likelihoods of veracity assignments, yet prior studies (e.g., Huang et al., 2023; Zhang et al., 2024) show that LMs are often poor self-verifiers, especially on real-world reasoning where correctness is tricky/subtle. Some discussion or empirical evidence of robustness on naturally occurring reasoning errors would strengthen the claims. 2. **Lack of comparison to process r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
