FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Jiedong Jiang; Wanyi He; Yuefeng Wang; Guoxiong Gao; Yongle Hu; Jingting Wang; Nailin Guan; Peihao Wu; Chunbo Dai; Liang Xiao; Bin Dong

arXiv:2511.02872·cs.LG·March 10, 2026

FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

PDF

Open Access 3 Reviews

TL;DR

FATE introduces a series of formal algebra benchmarks with varying difficulty levels to evaluate and advance the capabilities of language models in complex mathematical reasoning and formalization.

Contribution

The paper presents FATE, a novel benchmark series in formal algebra with problems surpassing PhD exam difficulty, and evaluates state-of-the-art models revealing significant performance gaps.

Findings

01

Models achieve only 3% accuracy on FATE-H and 0% on FATE-X.

02

Natural-language reasoning is more accurate than formalization.

03

Specialized provers may perform worse than general-purpose models.

Abstract

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Originality: The paper fills a major gap between contest-level and research-level mathematical benchmarks. The design of a progressive benchmark series is conceptually novel and important for longitudinal evaluation of reasoning systems. The detailed error taxonomy (Mathlib hallucination, Lean proficiency, misalignment, etc.) provides a structured lens for analyzing formalization failures. Quality: The benchmark construction is rigorous: problems collected, curated, and formalized through expert

Weaknesses

Lack of Actionable Insight from Core Finding: The central finding is that models fail at the translation/formalization stage. However, the paper offers little new, actionable mechanism or technique to address this challenge. It merely describes the phenomenon, classifies the symptoms (hallucinations, proficiency issues), and proposes a decoupled architecture as a future direction. Limited Scope for Deeper Behavioral Analysis (General vs. Prover): The argument for the general model's superior "e

Reviewer 02Rating 6Confidence 2

Strengths

The paper tackles an important gap: modern, abstract algebra at graduate and post-graduate levels, formalized in Lean. The graded design and the curation pipeline are carefully described, with inputs from expert mathematicians and Mathlib contributors, improving both difficulty control and content quality. The two-stage analysis is informative and supports a clear technical takeaway: current systems often find a plausible informal argument but fail to turn it into correct Lean code, mainly due t

Weaknesses

First, the claim that FATE-X is the first formal benchmark exceeding PhD qualifying exam difficulty and surpassing Mathlib coverage is important. Although the paper provides expert survey results and internal statistics, the case would be stronger with an external audit or a direct side-by-side with other recent research-level testbeds (e.g., RealMath or FrontierMath) in terms of formalization coverage and difficulty distribution. Second, the baseline suite is good, but the paper’s main scien

Reviewer 03Rating 8Confidence 4

Strengths

1. There are not enough formal math benchmarks, especially those targeting research-level math. Therefore, benchmarks like FATE constitute valuable contribution for the research community of AI for formal math. 2. The benchmark seems to be carefully designed and build, involving mathematicians and Lean experts -- this is important since it is relatively easy to introduce mistakes in the formal statements. 3. The analysis of the evaluation results are informative, interesting, and quite deep.

Weaknesses

1. The authors do not experiment with any more sophisticated inference strategies and jus use a fixed prompt and temperature to sample 64 times. It would be interesting to see, e.g., what performance can be achieved when the model receives Lean feedback for proof repair. 2. It would be good to see more examples of problems from the benchmark to get better feeling what is included. 3. It would be good to additionally see the performance of models from the Qwen 3 series. 4. The authors do not spec

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Logic, programming, and type systems · Topic Modeling