Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions
Jialiang Sun, Yuzhi Tang, Ao Li, Chris J. Maddison, Kuldeep S. Meel

TL;DR
The paper introduces the ECP framework, a neuro-symbolic approach combining LLMs and formal theorem proving to solve answer-construction problems in math competitions, improving accuracy and rigor.
Contribution
It presents a novel modular neuro-symbolic method, ECP, integrating LLM-based enumeration and conjecturing with formal verification in Lean, along with a new dataset of answer-construction problems.
Findings
ECP formally solves 6 out of 337 Putnam answer problems, up from 4.
ECP achieves 33.1% accuracy on ConstructiveBench, surpassing previous methods.
ECP demonstrates the potential of combining LLM conjecturing with formal verification.
Abstract
Mathematical reasoning is central to artificial intelligence, with applications in education, code generation, and research-level mathematical discovery. Mathematical competitions highlight two problem types: theorem proving, requiring rigorous proofs, and answer construction, requiring creative generation and formal verification of mathematical objects. Existing research reveals that LLMs can tackle difficult answer-construction tasks but are prone to errors from hallucinations and unverifiable steps, while symbolic methods guarantee rigor but falter in creative answer construction. This raises a key understudied question: how to solve answer-construction problems while preserving both LLM creativity and mathematical rigor? To address this problem, we introduce the Enumerate-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating LLM-based enumeration and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The effort for arriving at a correct autoformalization is commendable, involving multiple autoformalization models, retrieval of relevant documentation and Lean entitites, and a feedback loop. 2. The presented formalized dataset has significantly lower error rate than those presented in prior work. Further, including a human validation audit and an "after-cutoff" subset to address data contamination concerns adds significant credibility to the benchmark. 3. The `ConstructiveBench` dataset
1. The primary weakness is that the empirical improvements from the complex `ECP` pipeline are marginal, especially for state-of-the-art models (e.g., a 0.6% absolute gain on `ConstructiveBench`). The paper provides no confidence intervals or statistical significance tests, making it difficult to determine if these small gains are meaningful or simply due to noise. 2. The different components of the pipeline are not separately evaluated. It is crucial that the authors perform ablation studies t
- The constructiveBench dataset could be useful to the community for advancing neuro-symbolic systems. - ECP improved the state of the art on ConstructiveBench, raising GPT-5 mini’s answer-construction accuracy from 69.7% to 73.6% and its end-to-end problem-solving accuracy from 32.5% to 33.1%.
- It was unclear whether the focus of the paper is only on answer-construction problems or also on theorem-proving problems. Most of the discussions seem to be focusing only on answer-construction problem. Its better to clarify this. - Section 3.2 is bit cryptic and somewhat less clear. There is a scope of improving the writing to make it more accessible. - Figure 3 is not referred anywhere in the text. I, as a reader, found myself lost when reading Section 4.3 because it started in a bit abrupt
- Code is provided, well-documented, and well-structured. - Paper is clean and easy to read and understand. - The authors investigate an important area that is lacking in current formal reasoning, namely coming up with the answer. - The authors slightly outperform a reasonable baseline.
My main concerns with this paper is the lack of any interesting contribution. I have structured each of my concerns in three categories: major weaknesses, weaknesses, and remarks. The latter have not influenced my judgment, but should be fixed. **Major Weaknesses** - **ConstructiveBench is very limited**, and just another formal benchmark with little effort taken in its creation (a simple, cheap pipeline is used that the authors themselves show contains significant noise). As such, it provides
- **Clear problem focus & formulation:** The paper crisply separates theorem-proving from answer-construction and formalizes the latter in Lean’s dependent type theory, including canonical-format constraints to prevent trivial “echo the statement” answers. - **New dataset with care for quality:** ConstructiveBench (3,640 problems) includes multiple sources, deduplication, a retrieval-aided autoformalization loop, and a post-cutoff test split (106 problems after June 2024) to mitigate contaminati
- **Small end-to-end gains:** While answer-construction accuracy improves notably, end-to-end improvements are modest (e.g., 32.5% → 33.1% on ConstructiveBench; 4 → 6 on PutnamBench). It’s unclear whether these differences are statistically significant or practically meaningful given Pass@32 variance. - **Compute and sensitivity not fully surfaced:** The enumeration stage caps runtime (60 s) and candidates (≤100), but the paper does not quantify sensitivity to these budgets or how often enumerat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
