miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward
Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

TL;DR
This paper revisits the miniF2F benchmark, analyzing the limitations of current AI systems in formal mathematical reasoning, and introduces miniF2F-v2 with improved accuracy and verified statements to better evaluate progress in the field.
Contribution
The paper provides a detailed analysis of miniF2F's limitations, corrects discrepancies to create miniF2F-v2, and demonstrates significant accuracy improvements in theorem proving pipelines.
Findings
Best pipeline accuracy is 36% with current models.
Corrected discrepancies increased accuracy to 70%.
Majority of failures stem from formal-informal statement misalignments.
Abstract
We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLogic, programming, and type systems · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
