FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language   Models

Zhouliang Yu; Ruotian Peng; Keyi Ding; Yizhe Li; Zhongyuan Peng,; Minghao Liu; Yifan Zhang; Zheng Yuan; Huajian Xin; Wenhao Huang; Yandong Wen,; Ge Zhang; Weiyang Liu

arXiv:2505.02735·cs.AI·May 6, 2025

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng,, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen,, Ge Zhang, Weiyang Liu

PDF

Open Access 1 Repo 2 Datasets

TL;DR

FormalMATH introduces a large-scale, diverse benchmark for formal mathematical reasoning using Lean4, along with a novel autoformalization pipeline that reduces manual effort and reveals limitations of current LLM-based theorem provers.

Contribution

The paper presents FormalMATH, a comprehensive benchmark with an innovative human-in-the-loop autoformalization pipeline, and evaluates the performance and limitations of state-of-the-art LLM-based theorem provers.

Findings

01

State-of-the-art LLM provers achieve only 16.46% success rate.

02

Provers perform better in algebra than in calculus.

03

Chain-of-thought reasoning with natural language can decrease proof success.

Abstract

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sphere-ai-lab/formalmath-bench
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques