Evaluating LLM Reasoning Beyond Correctness and CoT
Soheil Abbasloo

TL;DR
This paper introduces SIEV, a novel framework for evaluating language models' reasoning as a dynamic, interactive process rather than just correctness, revealing significant gaps in current models' reasoning abilities.
Contribution
The paper presents SIEV, a dialectics-inspired evaluation method that assesses reasoning through explicit thesis-antithesis-synthesis interactions, offering interpretability and process insights.
Findings
GPT-5-chat loses over 40 points on GSM when evaluated with SIEV.
SIEV exposes reasoning gaps in state-of-the-art models.
Process-oriented evaluation reveals differences not captured by correctness metrics.
Abstract
What does it truly mean for a language model to "reason"? Current evaluations reward models' correct standalone answers-but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a dynamic trajectory in which ideas interact, clash, and evolve into integrated insights. Building on the philosophical tradition of dialectics, we introduce SIEV, a structured evaluation framework that assesses reasoning through explicit thesis-antithesis-synthesis interactions. SIEV produces interpretable trajectories that highlight key properties of reasoning-robustness to challenge, adaptability under conflict, and synthesis across competing viewpoints-dimensions that conventional correctness-based metrics cannot capture. Empirical results on GSM and MMLU demonstrate substantial gaps in the reasoning abilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Explainable Artificial Intelligence (XAI)
