VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala

TL;DR
VeriCoT introduces a neuro-symbolic approach that extracts formal logical arguments from chain-of-thought reasoning in large language models, enabling verification of logical validity and improving reasoning accuracy in high-stakes tasks.
Contribution
The paper presents VeriCoT, a novel method that formalizes chain-of-thought reasoning into first-order logic for automated verification and improved model performance.
Findings
VeriCoT effectively detects flawed reasoning in multiple datasets.
The verification signal enhances model accuracy through self-reflection and fine-tuning.
VeriCoT serves as a predictor of answer correctness.
Abstract
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor…
Peer Reviews
Decision·ICLR 2026 Poster
This paper tries to address the problem of verifying the step-by-step logical integrity of CoT, not just the final answer. The mechanism for generating implicit premises to fill logical gaps is multi-step beyond simple auto formalization. Verification, when it passes, is a high-precision signal, filtering incorrect reasoning and improving reliability over the base model. The authors demonstrate the utility for model improvement, using the verification signal for self-reflection and as a high-qua
1. The system "verifies" an LLM-generated CoT against LLM-generated premises. This is an internal consistency check, not a sound grounding check. There is lots of literature about LLMs preferring their own outputs. A human evaluation or ground truth is required to measure the error rate of the autoformalizer and the premise generator to prove the system isn't just "consistently wrong." 2. The low pass rate on LegalBench-SARA is concerning. Qualitative error analysis is needed to explain this fa
The core innovation lies in combining LLM-based autoformalization with SMT solver verification using Z3, which provides formal guarantees about logical consistency within the formalized representation. This approach offers more rigor than purely LLM-based verification methods, while remaining more flexible than pure formal methods that require pre-specified logical frameworks. I see an analogy with the informal process of proof checking of a human-generated proof in informal and formal reasonin
A fundamental concern is the circular dependency created by using LLMs for both autoformalization and premise generation, which are then employed to verify LLM outputs. This architectural decision introduces potential for systematic biases where the verifier may share failure modes with the system being verified. While the paper acknowledges this limitation, it doesn't adequately explore the implications or propose mitigation strategies. The system essentially asks LLMs to judge their own reason
This paper provides an intuitive procedure to validate the reliability of CoT sequences. In particular, looking at the VeriCOT failure cases is a good way of uncovering common "fallacies" of model CoTs (e.g., contradictions, ungrounded reasoning). Overall, the paper makes a nice coherent, and well-scoped contribution to test the reliability of LLM's CoT steps.
From the brief description of the baseline methods being compared to, it sounds like VeriCOT isn't the first to incorporate FOL (or similar expressions that is conducive to do verification along the way) to validate and improve CoT. It would be helpful to explicitly spell out how VeriCOT differs from these other approaches. Should there be another baseline where the LLM-judge just verifies each natural language reasoning step (without the FOL counterparts) to isolate the effect of FOL steps?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Logic, Reasoning, and Knowledge · Bayesian Modeling and Causal Inference
