IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
Param Biyani, Shashank Kirtania, Yasharth Bajpai, Sumit Gulwani, Ashish Tiwari

TL;DR
IndiMathBench is a human-verified benchmark for evaluating mathematical theorem proving, created using an AI-assisted pipeline to formalize natural language problems into Lean, highlighting the challenges of autoformalization and theorem proving.
Contribution
The paper introduces IndiMathBench, a novel human-verified benchmark for mathematical autoformalization using an AI-assisted pipeline, addressing data scarcity and validation challenges.
Findings
Autoformalization remains challenging with significant gaps in semantic correctness.
The success rate of theorem proving on the benchmark is low, even with iterative refinement.
The benchmark provides a rigorous testbed for advancing mathematical reasoning models.
Abstract
Reliable autoformalization remains challenging even in the era of large language models (LLMs). The scarcity of high-quality training data is a major bottleneck. Expert annotation requires substantial time and deep expertise in both mathematics and theorem proving. We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. Timely and relevant resource: the paper offers a curated Lean4 benchmark that fills a data gap for formal mathematics research, especially given recent interest in LLM-based theorem proving. 2. Human-in-the-loop pipeline: it gets human involved in multi-LLM synthesis, Lean validation, human repair cycle, and is thoughtfully designed and practically useful for future dataset building. 3. Open release commitment: Providing both dataset and dashboard promotes reproducibility and community adop
1. Weak experimental insights: evaluation simply shows that current LLMs fail badly, but there's a shortage of deeper analysis (e.g., failure types, linguistic vs. logical errors, success conditions). 2. Need more justification in reliability and feasibility of the benchmark: reporting that all models fail is not very informative without breakdowns or ablations. Besides, more theoretical proof or imperical evidence (at least human insight or so?) will be beneficial to make the benchmark more at
* **Valuable Dataset Contribution:** Creating a high-quality, expert-verified dataset of formalized mathematics is, in itself, a beneficial contribution to the community. The release of IndiMathBench provides researchers with a new testbed that is potentially "uncontaminated" by existing models. * **Detailed Annotation Efficiency Study:** The paper presents a time-cost comparison of different annotation workflows (manual-only, multi-model assisted, full system) in Figure 4. This provides qua
1. **Lack of Novelty in the Core Contribution (Benchmark Creation):** The paper's main contribution is a new autoformalization benchmark. However, in the domain of competition mathematics, similar formalization benchmarks (e.g., MiniF2F, Putnam-Bench) are already abundant and continue to grow. The methodology used to create IndiMathBench—formalizing math competition problems—is essentially a straightforward migration of existing workflows and lacks methodological innovation. The paper claims th
1. IndiMathBench covers problem types such as set theory, combinatorics, and geometry that are not typically included in existing ATP benchmark datasets (like miniF2F). This diversity enhances the practical value of the dataset as a comprehensive ATP evaluation resource. 2. The proposed VS Code dashboard holds considerable potential for accelerating data annotation in this field. As shown in Figure 4 of the paper, the pipeline seems to significantly reduce the time required to annotate formal m
1. The proposed autoformalization pipeline is not truly novel; rather, it largely applies existing techniques to the Indian Mathematics Olympics dataset with some engineering optimizations. Specifically, the idea of a self-debug loop using error messages from the Lean compiler has appeared in prior works [1]; the concept of human feedback-assisted adjustment has been explored in earlier work [2]; and the use of retrieval from mathlib to enhance autoformalization capability is also present in pre
1. The formalization of the problems presented in the paper is of good quality, especially the clever formalization of the three integer roots in the problem inmo_2017_2 in appendix A. 2. The pipeline of this paper has indeed improved the annotation efficiency by 2-3 times that of human experts, which is a certain improvement. 3. It is original and has engineering value. It transforms an AI-assisted formalization framework into a user-friendly VSCode extension, which may be useful for the develo
The two main contributions of the paper, the **formal benchmark** itself and the **autoformalization pipeline**, both have certain weaknesses. Regarding the formal benchmark itself, 1. The benchmark is not substantially different from the existing IMO benchmark, and its coverage remains almost identical. Other formal benchmarks at the high school competition level, such as miniF2F [1] and Fimo [2], already exist. 2. Concerning the issue of data contamination with MiniF2F and IMO, I believe tha
Videos
Taxonomy
TopicsMathematics, Computing, and Information Processing · Mathematics Education and Teaching Techniques · Teaching and Learning Programming
