Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems
Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang

TL;DR
Proof2Hybrid is an automated framework that synthesizes proof-centric mathematical benchmarks from natural language corpora, enabling more accurate evaluation of LLMs' mathematical reasoning abilities, demonstrated through the AlgGeoTest benchmark.
Contribution
The paper introduces Proof2X, a novel method for converting proofs into verifiable questions, and creates AlgGeoTest, a new algebraic geometry benchmark for assessing LLMs.
Findings
LLMs show significant gaps in understanding algebraic geometry
The hybrid question format improves robustness of evaluation
Automated benchmark synthesis scales evaluation of mathematical reasoning
Abstract
Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named ``-out-of- multiple judge questions'', specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper offers interesting insights into the synthetic framework for mathematical competence verification benchmarks and introduces AlgGeoTest, a noteworthy algebraic geometry benchmark featuring hybrid-format problems. - It provides a comprehensive overview of relevant research on mathematical benchmarks.
- Writing clarity: The paper lacks clear presentation, making it difficult to identify key insights. - Insufficient data examples: Figure 1 and the question format comparison in Figure 3 are not adequately described or supported with clear examples, weakening the persuasiveness of the paper's core contributions. - Limited benchmark comparisons: The paper does not provide sufficient comparisons with other mathematical benchmarks (e.g., those listed in Table 1). An analysis of performance variatio
The question generation pipeline that is model agnostic has a lot of potential.
- I have some misgivings about an entirely LLM-assisted pipeline. This may propagate LLM biases in unexpected ways. - there is a single figure with results. These seem hard to read, and to take home information.
- A concrete answer to a real gap. Prior math benchmarks skew to numeric answers; proof-centric evaluation at scale is missing. The paper directly targets this gap with an automatic pipeline over a natural-language corpus rather than formal systems only. - Format innovation with clear rationale. The m-out-of-n format is well-motivated: it reduces chance accuracy, blocks option-comparison shortcuts, and reframes evaluation as relative correctness ranking, which can reduce sensitivity to each mod
A. Single-domain instantiation. The method is positioned as domain-agnostic, but the paper only shows algebraic geometry. To support generality, at least one additional area (e.g., commutative algebra or topology) would strengthen the claim. B. Style and memorization confounds. The true items are original seeds from the Stacks Project, while false items are model-generated edits. Well-trained models may recognize the “house style” of Stacks and prefer those options. A control where true items
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Polynomial and algebraic computation
