Riemann-Bench: A Benchmark for Moonshot Mathematics

Suhaas Garre; Erik Knutsen; Sushant Mehta; Edwin Chen

arXiv:2604.06802·cs.AI·April 9, 2026

Riemann-Bench: A Benchmark for Moonshot Mathematics

Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen

PDF

TL;DR

Riemann-Bench is a private, expert-curated benchmark of 25 research-level math problems designed to evaluate AI systems' deep mathematical reasoning beyond olympiad-level skills.

Contribution

The paper introduces Riemann-Bench, a novel private benchmark with expert-verified research-level problems to assess AI mathematical reasoning capabilities.

Findings

01

Frontier models score below 10% on the benchmark.

02

Current models show a large gap compared to human research-level reasoning.

03

The benchmark is private to prevent memorization and ensure genuine evaluation.

Abstract

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.