Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck; Nikola Jovanovi\'c; Tim Gehrunger; K\'ari R\"ognvaldsson; Ivo Petrov; Chenhao Sun; Martin Vechev

arXiv:2605.00674·cs.CL·May 18, 2026

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanovi\'c, Tim Gehrunger, K\'ari R\"ognvaldsson, Ivo Petrov, Chenhao Sun, Martin Vechev

PDF

50 Datasets

TL;DR

MathArena is an expanded, continuously maintained evaluation platform for mathematical reasoning with LLMs, covering diverse tasks from olympiad problems to formal proofs, enabling reliable progress tracking.

Contribution

This work extends MathArena from static benchmarks to a dynamic platform, broadening its scope and establishing protocols for ongoing evaluation of LLMs in mathematics.

Findings

01

GPT-5.5 achieves 98% on 2026 USA Math Olympiad problems.

02

GPT-5.5 reaches 74% on research-level questions.

03

MathArena now includes proof competitions, arXiv problems, and formal proof generation.

Abstract

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.