FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng,, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen,, Ge Zhang, Weiyang Liu

TL;DR
FormalMATH introduces a large-scale, diverse benchmark for formal mathematical reasoning using Lean4, along with a novel autoformalization pipeline that reduces manual effort and reveals limitations of current LLM-based theorem provers.
Contribution
The paper presents FormalMATH, a comprehensive benchmark with an innovative human-in-the-loop autoformalization pipeline, and evaluates the performance and limitations of state-of-the-art LLM-based theorem provers.
Findings
State-of-the-art LLM provers achieve only 16.46% success rate.
Provers perform better in algebra than in calculus.
Chain-of-thought reasoning with natural language can decrease proof success.
Abstract
Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
