Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Yebo Peng; Zixiang Liu; Yaoming Li; Zhizhuo Yang; Xinye Xu; Bowen Ye; Weijun Yuan; Zihan Wang; Tong Yang

arXiv:2508.02208·cs.CL·August 6, 2025

Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

Proof2Hybrid is an automated framework that synthesizes proof-centric mathematical benchmarks from natural language corpora, enabling more accurate evaluation of LLMs' mathematical reasoning abilities, demonstrated through the AlgGeoTest benchmark.

Contribution

The paper introduces Proof2X, a novel method for converting proofs into verifiable questions, and creates AlgGeoTest, a new algebraic geometry benchmark for assessing LLMs.

Findings

01

LLMs show significant gaps in understanding algebraic geometry

02

The hybrid question format improves robustness of evaluation

03

Automated benchmark synthesis scales evaluation of mathematical reasoning

Abstract

Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named `` $m$ -out-of- $n$ multiple judge questions'', specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The paper offers interesting insights into the synthetic framework for mathematical competence verification benchmarks and introduces AlgGeoTest, a noteworthy algebraic geometry benchmark featuring hybrid-format problems. - It provides a comprehensive overview of relevant research on mathematical benchmarks.

Weaknesses

- Writing clarity: The paper lacks clear presentation, making it difficult to identify key insights. - Insufficient data examples: Figure 1 and the question format comparison in Figure 3 are not adequately described or supported with clear examples, weakening the persuasiveness of the paper's core contributions. - Limited benchmark comparisons: The paper does not provide sufficient comparisons with other mathematical benchmarks (e.g., those listed in Table 1). An analysis of performance variatio

Reviewer 02Rating 2Confidence 4

Strengths

The question generation pipeline that is model agnostic has a lot of potential.

Weaknesses

- I have some misgivings about an entirely LLM-assisted pipeline. This may propagate LLM biases in unexpected ways. - there is a single figure with results. These seem hard to read, and to take home information.

Reviewer 03Rating 6Confidence 3

Strengths

- A concrete answer to a real gap. Prior math benchmarks skew to numeric answers; proof-centric evaluation at scale is missing. The paper directly targets this gap with an automatic pipeline over a natural-language corpus rather than formal systems only. - Format innovation with clear rationale. The m-out-of-n format is well-motivated: it reduces chance accuracy, blocks option-comparison shortcuts, and reframes evaluation as relative correctness ranking, which can reduce sensitivity to each mod

Weaknesses

A. Single-domain instantiation. The method is positioned as domain-agnostic, but the paper only shows algebraic geometry. To support generality, at least one additional area (e.g., commutative algebra or topology) would strengthen the claim. B. Style and memorization confounds. The true items are original seeds from the Stacks Project, while false items are model-generated edits. Well-trained models may recognize the “house style” of Stacks and prefer those options. A control where true items

Code & Models

Datasets

PKU-DS-LAB/AlgGeoTest
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Polynomial and algebraic computation