Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms

Yang Cao; Yubin Chen; Xuyang Guo; Zhao Song; Song Yue; Jiahao Zhang; Jiale Zhao

arXiv:2512.13978·cs.AI·December 17, 2025

Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms

Yang Cao, Yubin Chen, Xuyang Guo, Zhao Song, Song Yue, Jiahao Zhang, Jiale Zhao

PDF

Open Access

TL;DR

This paper benchmarks four advanced large language models on graduate-level mathematical reasoning tasks from a textbook on randomized algorithms, assessing their accuracy, logical coherence, and reliability in formal proof generation.

Contribution

It provides the first comprehensive evaluation of frontier LLMs on a canonical graduate-level mathematics curriculum, highlighting their strengths and limitations in formal reasoning.

Findings

01

Top-tier models achieve approximately 66% accuracy in proof generation.

02

Models show significant variance in consistency and logical coherence.

03

Frontier models are promising for pedagogical use but need improvements for rigorous proof tasks.

Abstract

The rapid advancement of large language models (LLMs) has led to significant breakthroughs in automated mathematical reasoning and scientific discovery. Georgiev, G $\overset{o}{ˊ}$ mez-Serrano, Tao, and Wagner [GGSTW+25] demonstrate that AI systems can explore new constructions and improve existing bounds, illustrating the growing potential of LLMs to accelerate mathematical discovery. Similarly, Bubeck et al. [BCE+25] show that GPT-5 can meaningfully contribute to scientific workflows, from proposing hypotheses to generating proofs and analyses. Despite these advances, a rigorous evaluation of these models on canonical, graduate-level mathematical theory remains necessary to understand their baseline reasoning capabilities. In this paper, we present a comprehensive benchmark of four frontier models: GPT-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4 against the classic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Mathematics, Computing, and Information Processing · Scientific Computing and Data Management