Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux; Gail Weiss; Viktor Kun\v{c}ak; Antoine Bosselut

arXiv:2406.07222·cs.CL·October 30, 2025

Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux, Gail Weiss, Viktor Kun\v{c}ak, Antoine Bosselut

PDF

Open Access 1 Repo 2 Datasets 2 Videos

TL;DR

This paper introduces improved metrics, datasets, and benchmarks for evaluating statement autoformalization, enabling more reliable assessment of systems translating natural language mathematics into formal languages like Lean 4.

Contribution

It presents BEq+ and ProofNetVerif for better evaluation, along with new benchmarks ProofNet# and RLM25, to systematically measure autoformalization progress.

Findings

01

Current techniques achieve up to 45.1% accuracy on undergraduate math.

02

Systems struggle with research-level content without proper context.

03

The work establishes a reliable foundation for future autoformalization evaluation.

Abstract

Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

augustepoiroux/LeanInteract
noneOfficial

Datasets

Videos

Reliable Evaluation and Benchmarks for Statement Autoformalization· underline

Taxonomy

TopicsMultimedia Communication and Technology

MethodsSparse Evolutionary Training · Balanced Selection