QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez; Alireza Amiri Bavandpour; Peter Ye; Edward Zhang; Ruslans Aleksejevs; Todor Anti\'c; Polina Baron; Sujeet Bhalerao; Shubhrajit Bhattacharya; Zachary Burton; John Byrne; Hyungjun Choi; Nujhat Ahmed Disha; Koppany Istv\'an Encz; Yuchen Fang; Robert Joseph George; Ebrahim Ghorbani; Alan Goldfarb; Jing Guo; Meghal Gupta; Stefano Huber; Annika Kanckos; Minjung Kang; Hyun Jong Kim; Dino Lorenzini; Levi Lorenzo; Tianyi Mao; Giovanni Marzenta; Ariane M. Masuda; Lukas Mauth; Ana Mickovic; Andres Miniguano-Trujillo; Antoine Moulin; Wenqi Ni; Tomos Parry; Kevin Ren; Hossein Roodbarani; Mathieu Rundstr\"om; Manjil Saikia; Detchat Samart; Rebecca Steiner; Connor Stewart; Dhara Thakkar; Jeffrey Tse; Vasiliki Velona; Yunhai Xiang; Sibel Yal\c{c}{\i}n; Jun Yan; Ji Zeng; Arman Cohan; Quanquan C. Liu

arXiv:2602.20629·cs.LG·March 3, 2026

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Anti\'c, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany Istv\'an Encz, Yuchen Fang, Robert Joseph George

PDF

Open Access 1 Datasets

TL;DR

This paper introduces QEDBench, a large-scale benchmark to measure how well automated evaluators align with human experts in assessing university-level mathematical proofs, revealing biases and reasoning gaps in current models.

Contribution

The paper presents QEDBench, the first comprehensive benchmark for evaluating AI judges of university math proofs, highlighting systematic biases and reasoning limitations in current models.

Findings

01

Certain AI evaluators exhibit significant score inflation.

02

State-of-the-art models perform well in continuous domains but poorly in discrete math.

03

QEDBench is publicly available for further research.

Abstract

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

qqggez/QEDBench
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Advanced Graph Neural Networks