Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Dadi Guo; Jiayu Liu; Zhiyuan Fan; Zhitao He; Haoran Li; Yuxin Li; Yumeng Wang; Yi R. Fung

arXiv:2506.17114·cs.AI·December 10, 2025

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yuxin Li, Yumeng Wang, Yi R. Fung

PDF

Open Access

TL;DR

This paper introduces a diagnostic dataset of mathematical proofs to evaluate large reasoning models, revealing their significant shortcomings in proof correctness, reasoning rigor, and hallucination issues, which are masked by traditional benchmarks.

Contribution

The paper presents RFMDataset, a new benchmark for diagnosing reasoning failures in large models through mathematical proofs, uncovering diverse error types and fundamental limitations.

Findings

01

Models solve less than 20% of problems correctly.

02

Models lack guarantees for reasoning correctness.

03

Models exhibit hallucination and incompleteness.

Abstract

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Bayesian Modeling and Causal Inference · Mathematics, Computing, and Information Processing