Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic
Abinav Rao, Sujan Rachuri, Nikhil Vemuri

TL;DR
This paper introduces the Novel Operator Test, a benchmark to distinguish genuine reasoning from pattern retrieval in LLMs, revealing reasoning-output dissociations and specific failure modes at various depths.
Contribution
The paper presents a new benchmark that isolates reasoning from pattern recognition, enabling detailed analysis of LLM reasoning failures and their causes.
Findings
Models often produce correct reasoning but wrong answers.
Scaffolding improves strategy failure rates significantly.
Models struggle with reasoning on novel logic operators, especially at greater depths.
Abstract
LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4's depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
