Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Jasper Dekoninck, Nikola Jovanovi\'c, Tim Gehrunger, K\'ari R\"ognvaldsson, Ivo Petrov, Chenhao Sun, Martin Vechev

TL;DR
MathArena is an expanded, continuously maintained evaluation platform for mathematical reasoning with LLMs, covering diverse tasks from olympiad problems to formal proofs, enabling reliable progress tracking.
Contribution
This work extends MathArena from static benchmarks to a dynamic platform, broadening its scope and establishing protocols for ongoing evaluation of LLMs in mathematics.
Findings
GPT-5.5 achieves 98% on 2026 USA Math Olympiad problems.
GPT-5.5 reaches 74% on research-level questions.
MathArena now includes proof competitions, arXiv problems, and formal proof generation.
Abstract
Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- MathArena/aime_2026dataset· 15k dl15k dl
- MathArena/aime_2025_I_outputsdataset· 52 dl52 dl
- MathArena/aime_2025_II_outputsdataset· 66 dl66 dl
- MathArena/hmmt_feb_2025_outputsdataset· 149 dl149 dl
- MathArena/usamo_2025_outputsdataset· 168 dl168 dl
- MathArena/usamo_2025dataset· 186 dl186 dl
- MathArena/aime_2025_Idataset· 438 dl438 dl
- MathArena/aime_2025_IIdataset· 365 dl365 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
