Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Subramanyam Sahoo; Vinija Jain; Saanidhya Vats; Siddharth Mohapatra; Rui Min; Aman Chadha; Divya Chaudhary

arXiv:2512.00552·cs.CL·December 2, 2025

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Subramanyam Sahoo, Vinija Jain, Saanidhya Vats, Siddharth Mohapatra, Rui Min, Aman Chadha, Divya Chaudhary

PDF

Open Access 1 Video

TL;DR

This paper introduces a diagnostic framework to evaluate the true mathematical reasoning ability of language models, revealing that small models often rely on pattern matching rather than genuine logical computation, despite high answer accuracy.

Contribution

The paper presents a novel, model-agnostic diagnostic framework for assessing reasoning fidelity in language models, moving beyond traditional accuracy metrics.

Findings

01

Qwen3-0.6B achieves 70%+ accuracy but only 15% backward consistency.

02

Limited transitivity coverage at 32.2% indicates reasoning failures.

03

Models are brittle to perturbations, exposing superficial reasoning.

Abstract

Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications