LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani

TL;DR
LiveMathematicianBench is a new, realistic benchmark for evaluating large language models' mathematical reasoning abilities using recent arXiv papers and a proof-sketch-guided distractor pipeline.
Contribution
It introduces a dynamic, contamination-resistant benchmark with a detailed logical taxonomy and proof strategy-based distractors for research-level mathematical reasoning evaluation.
Findings
Gemini-3.1-pro-preview scores 43.5% accuracy.
GPT-5.4 scores 30.6% accuracy under substitution-resistant evaluation.
Proof-sketch access improves model reasoning performance.
Abstract
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
