Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yuxin Li, Yumeng Wang, Yi R. Fung

TL;DR
This paper introduces a diagnostic dataset of mathematical proofs to evaluate large reasoning models, revealing their significant shortcomings in proof correctness, reasoning rigor, and hallucination issues, which are masked by traditional benchmarks.
Contribution
The paper presents RFMDataset, a new benchmark for diagnosing reasoning failures in large models through mathematical proofs, uncovering diverse error types and fundamental limitations.
Findings
Models solve less than 20% of problems correctly.
Models lack guarantees for reasoning correctness.
Models exhibit hallucination and incompleteness.
Abstract
Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Bayesian Modeling and Causal Inference · Mathematics, Computing, and Information Processing
