Mathesis: Towards Formal Theorem Proving from Natural Languages
Yu Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Yousefzadeh, Wei Chong Ng, Haoxiong Liu, Ziyi Shou, Jing Xiong, Yudong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Yaoxi Zhang, Wai Ming Tai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, Zhenguo Li

TL;DR
Mathesis presents an end-to-end pipeline that automates formal theorem proving from natural language problem statements, combining autoformalization and proof generation, and demonstrates strong results on real-world exam problems.
Contribution
It introduces the first autoformalizer using reinforcement learning and a comprehensive theorem proving pipeline capable of handling informal natural language inputs.
Findings
Autoformalizer outperforms baselines by 22% in pass-rate on Gaokao-Formal.
Achieves 64% accuracy on MiniF2F with pass@32.
Attains 18% state-of-the-art accuracy on Gaokao-Formal.
Abstract
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex…
Peer Reviews
Decision·ICLR 2026 Poster
* Evaluating the semantic correctness of autoformalized statements is an important bottleneck in autoformalization. The paper proposed LeanScorer that aims to address this problem * This paper is one of the first to train autoformalization models with RL, which is nontrivial because of the difficulty in semantic evaluation.
* The task of "formal theorem proving from natural language" in this paper is not convincing. In this paper, it is defined as: Given a theorem statement in natural language, the model generates its formal statement and the corresponding formal proof. It is evaluated using proof accuracy (Fig. 4). This task definition does not evaluate whether the formalized statement is semantically correct. The model could formalize the natural language statement into a easier formal statement so that the proof
The combination of Reinforcement Learning with Hierarchical Preference Optimization (HPO) for the autoformalization task is the highlight of this paper. This approach enables dynamic learning from the feedback of a syntax checker, a semantic evaluator, and a downstream prover, significantly improving the quality of the formalized statements. The experimental results (e.g., a 45% relative improvement on Gaokao-Formal) fully demonstrate the effectiveness of this method. The authors not only teste
The first is about the dependence of LeanScorer on an LLM Evaluator. The core steps of LeanScorer (sub-task decomposition and consistency annotation) rely on an LLM. While experiments show high agreement with human annotators, this still introduces a potential source of noise, which to some extent undermines the credibility of the evaluation. The second is that the paper uses downstream proof success as a metric may introduce bias. This could potentially lead the model to favor generating weak
1. Formal theorem proving is a highly active research domain, and the motivation for the proposed system is clear and compelling. 2. The proposed methods are conceptually simple yet empirically effective. 3. The introduction of Gaokao-Formal is valuable in this field.
1. The paper lacks an adequate quality assessment of the Gaokao-Formal dataset. Some details are missing, such as annotator expertise, inter-annotator agreement statistics, and the protocol for resolving disagreements. Moreover, the subset used in Section 4.1 is not described, making it difficult to interpret the reported F1 scores. 2. LeanScorer introduces human priors and relies on additional compute (subtask decomposition → evaluation → aggregation) to produce higher scores. As such, it shoul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems · Topic Modeling · Mathematics, Computing, and Information Processing
