FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels
Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

TL;DR
FATE introduces a series of formal algebra benchmarks with varying difficulty levels to evaluate and advance the capabilities of language models in complex mathematical reasoning and formalization.
Contribution
The paper presents FATE, a novel benchmark series in formal algebra with problems surpassing PhD exam difficulty, and evaluates state-of-the-art models revealing significant performance gaps.
Findings
Models achieve only 3% accuracy on FATE-H and 0% on FATE-X.
Natural-language reasoning is more accurate than formalization.
Specialized provers may perform worse than general-purpose models.
Abstract
Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new…
Peer Reviews
Decision·ICLR 2026 Poster
Originality: The paper fills a major gap between contest-level and research-level mathematical benchmarks. The design of a progressive benchmark series is conceptually novel and important for longitudinal evaluation of reasoning systems. The detailed error taxonomy (Mathlib hallucination, Lean proficiency, misalignment, etc.) provides a structured lens for analyzing formalization failures. Quality: The benchmark construction is rigorous: problems collected, curated, and formalized through expert
Lack of Actionable Insight from Core Finding: The central finding is that models fail at the translation/formalization stage. However, the paper offers little new, actionable mechanism or technique to address this challenge. It merely describes the phenomenon, classifies the symptoms (hallucinations, proficiency issues), and proposes a decoupled architecture as a future direction. Limited Scope for Deeper Behavioral Analysis (General vs. Prover): The argument for the general model's superior "e
The paper tackles an important gap: modern, abstract algebra at graduate and post-graduate levels, formalized in Lean. The graded design and the curation pipeline are carefully described, with inputs from expert mathematicians and Mathlib contributors, improving both difficulty control and content quality. The two-stage analysis is informative and supports a clear technical takeaway: current systems often find a plausible informal argument but fail to turn it into correct Lean code, mainly due t
First, the claim that FATE-X is the first formal benchmark exceeding PhD qualifying exam difficulty and surpassing Mathlib coverage is important. Although the paper provides expert survey results and internal statistics, the case would be stronger with an external audit or a direct side-by-side with other recent research-level testbeds (e.g., RealMath or FrontierMath) in terms of formalization coverage and difficulty distribution. Second, the baseline suite is good, but the paper’s main scien
1. There are not enough formal math benchmarks, especially those targeting research-level math. Therefore, benchmarks like FATE constitute valuable contribution for the research community of AI for formal math. 2. The benchmark seems to be carefully designed and build, involving mathematicians and Lean experts -- this is important since it is relatively easy to introduce mistakes in the formal statements. 3. The analysis of the evaluation results are informative, interesting, and quite deep.
1. The authors do not experiment with any more sophisticated inference strategies and jus use a fixed prompt and temperature to sample 64 times. It would be interesting to see, e.g., what performance can be achieved when the model receives Lean feedback for proof repair. 2. It would be good to see more examples of problems from the benchmark to get better feeling what is included. 3. It would be good to additionally see the performance of models from the Qwen 3 series. 4. The authors do not spec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Logic, programming, and type systems · Topic Modeling
