Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions

Jialiang Sun; Yuzhi Tang; Ao Li; Chris J. Maddison; Kuldeep S. Meel

arXiv:2505.18492·cs.AI·October 21, 2025

Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions

Jialiang Sun, Yuzhi Tang, Ao Li, Chris J. Maddison, Kuldeep S. Meel

PDF

Open Access 1 Repo 1 Datasets 4 Reviews

TL;DR

The paper introduces the ECP framework, a neuro-symbolic approach combining LLMs and formal theorem proving to solve answer-construction problems in math competitions, improving accuracy and rigor.

Contribution

It presents a novel modular neuro-symbolic method, ECP, integrating LLM-based enumeration and conjecturing with formal verification in Lean, along with a new dataset of answer-construction problems.

Findings

01

ECP formally solves 6 out of 337 Putnam answer problems, up from 4.

02

ECP achieves 33.1% accuracy on ConstructiveBench, surpassing previous methods.

03

ECP demonstrates the potential of combining LLM conjecturing with formal verification.

Abstract

Mathematical reasoning is central to artificial intelligence, with applications in education, code generation, and research-level mathematical discovery. Mathematical competitions highlight two problem types: theorem proving, requiring rigorous proofs, and answer construction, requiring creative generation and formal verification of mathematical objects. Existing research reveals that LLMs can tackle difficult answer-construction tasks but are prone to errors from hallucinations and unverifiable steps, while symbolic methods guarantee rigor but falter in creative answer construction. This raises a key understudied question: how to solve answer-construction problems while preserving both LLM creativity and mathematical rigor? To address this problem, we introduce the Enumerate-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating LLM-based enumeration and…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The effort for arriving at a correct autoformalization is commendable, involving multiple autoformalization models, retrieval of relevant documentation and Lean entitites, and a feedback loop. 2. The presented formalized dataset has significantly lower error rate than those presented in prior work. Further, including a human validation audit and an "after-cutoff" subset to address data contamination concerns adds significant credibility to the benchmark. 3. The `ConstructiveBench` dataset

Weaknesses

1. The primary weakness is that the empirical improvements from the complex `ECP` pipeline are marginal, especially for state-of-the-art models (e.g., a 0.6% absolute gain on `ConstructiveBench`). The paper provides no confidence intervals or statistical significance tests, making it difficult to determine if these small gains are meaningful or simply due to noise. 2. The different components of the pipeline are not separately evaluated. It is crucial that the authors perform ablation studies t

Reviewer 02Rating 4Confidence 3

Strengths

- The constructiveBench dataset could be useful to the community for advancing neuro-symbolic systems. - ECP improved the state of the art on ConstructiveBench, raising GPT-5 mini’s answer-construction accuracy from 69.7% to 73.6% and its end-to-end problem-solving accuracy from 32.5% to 33.1%.

Weaknesses

- It was unclear whether the focus of the paper is only on answer-construction problems or also on theorem-proving problems. Most of the discussions seem to be focusing only on answer-construction problem. Its better to clarify this. - Section 3.2 is bit cryptic and somewhat less clear. There is a scope of improving the writing to make it more accessible. - Figure 3 is not referred anywhere in the text. I, as a reader, found myself lost when reading Section 4.3 because it started in a bit abrupt

Reviewer 03Rating 2Confidence 4

Strengths

- Code is provided, well-documented, and well-structured. - Paper is clean and easy to read and understand. - The authors investigate an important area that is lacking in current formal reasoning, namely coming up with the answer. - The authors slightly outperform a reasonable baseline.

Weaknesses

My main concerns with this paper is the lack of any interesting contribution. I have structured each of my concerns in three categories: major weaknesses, weaknesses, and remarks. The latter have not influenced my judgment, but should be fixed. **Major Weaknesses** - **ConstructiveBench is very limited**, and just another formal benchmark with little effort taken in its creation (a simple, cheap pipeline is used that the authors themselves show contains significant noise). As such, it provides

Reviewer 04Rating 6Confidence 4

Strengths

- **Clear problem focus & formulation:** The paper crisply separates theorem-proving from answer-construction and formalizes the latter in Lean’s dependent type theory, including canonical-format constraints to prevent trivial “echo the statement” answers. - **New dataset with care for quality:** ConstructiveBench (3,640 problems) includes multiple sources, deduplication, a retrieval-aided autoformalization loop, and a post-cutoff test split (106 problems after June 2024) to mitigate contaminati

Weaknesses

- **Small end-to-end gains:** While answer-construction accuracy improves notably, end-to-end improvements are modest (e.g., 32.5% → 33.1% on ConstructiveBench; 4 → 6 on PutnamBench). It’s unclear whether these differences are statistically significant or practically meaningful given Pass@32 variance. - **Compute and sensitivity not fully surfaced:** The enumeration stage caps runtime (60 s) and candidates (≤100), but the paper does not quantify sensitivity to these budgets or how often enumerat

Code & Models

Repositories

jacksun200312/ecp
noneOfficial

Datasets

sunjia72/ConstructiveBench
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, Reasoning, and Knowledge · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques