LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

Linyang He; Qiyao Yu; Hanze Dong; Baohao Liao; Xinxing Xu; Micah Goldblum; Jiang Bian; Nima Mesgarani

arXiv:2604.01754·cs.CL·April 3, 2026

LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani

PDF

TL;DR

LiveMathematicianBench is a new, realistic benchmark for evaluating large language models' mathematical reasoning abilities using recent arXiv papers and a proof-sketch-guided distractor pipeline.

Contribution

It introduces a dynamic, contamination-resistant benchmark with a detailed logical taxonomy and proof strategy-based distractors for research-level mathematical reasoning evaluation.

Findings

01

Gemini-3.1-pro-preview scores 43.5% accuracy.

02

GPT-5.4 scores 30.6% accuracy under substitution-resistant evaluation.

03

Proof-sketch access improves model reasoning performance.

Abstract

Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.