FMC: Formalization of Natural Language Mathematical Competition Problems
Jiaxuan Xie, Chengwu Liu, Ye Yuan, Siqi Li, Zhiping Xiao, Ming Zhang

TL;DR
This paper introduces an autoformalization pipeline using large language models to convert natural language mathematical problems into formal representations, creating a new benchmark dataset for automated theorem proving.
Contribution
It presents a fully automatic, training-free autoformalization method, curates a large Olympiad-level dataset, and evaluates LLMs and theorem provers on this challenging benchmark.
Findings
Few-shot learning improves formalization accuracy.
Error feedback enhances autoformalization performance.
The dataset is challenging for current theorem provers.
Abstract
Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises mathematical problems in natural language and in Lean, of which were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed combination of error feedback and problem decomposition provides a well-motivated and empirically validated improvement over prior autoformalization frameworks. The training-free design is elegant and practical. - The dataset focuses on Olympiad-level problems, ensuring both semantic richness and difficulty. The reported statistics (93% syntactic, 67% semantic accuracy) are impressive given the full automation and task complexity. - The paper conducts detailed comparisons across m
- While the combination of known components (few-shot prompting, error feedback, decomposition) works well, the conceptual novelty is somewhat incremental compared to existing autoformalization frameworks such as StepFun-Formalizer or KELPS. - The paper includes some case studies, but a deeper quantitative or typological analysis of the 33% semantic failures would strengthen the understanding of model limitations and inform future improvements. - The paper does not specify whether the dataset, c
- The overall idea of the paper is well-motivated and technically sound. The combination of few-shot learning and error feedback effectively enriches contextual information, which can improve the robustness and accuracy of the autoformalization process. Moreover, problem decomposition bridges the gap between natural language expressions and their formal counterparts in Lean, facilitating more reliable semantic alignment and verification. - The paper conducts a thorough and systematic ablation st
- The paper contains several presentation inconsistencies that detract from readability and professionalism. For instance, Table 4 is never cited in the main text, and the captions for some figures and tables are overly minimal. Captions such as that of Figure 1 should include a more detailed explanation of the depicted pipeline. Additionally, there is a numerical inconsistency in Section 4.3: the values 3,922 and 87.14% appear only once, whereas 3,214 and 66.99% are used elsewhere. This discrep
* The paper is easy to follow, with a clear motivation. * The proposed dataset could be valuable for training LLMs in both autoformalization and theorem proving tasks.
The proposed method lacks novelty. Techniques such as few-shot prompting, syntactic correction through error feedback, and semantic verification have already been explored in prior work [1, 2, 3, 4]. Moreover, the autoformalized statements produced by the system are only guaranteed to be syntactically valid, with semantic consistency checked solely by LLMs. There is no manual verification to ensure that the formal statements truly align with their original natural-language counterparts. In many
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques
