MathDuels: Evaluating LLMs as Problem Posers and Solvers
Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik

TL;DR
MathDuels introduces a dynamic, self-play benchmark where language models both create and solve math problems, revealing nuanced capabilities beyond traditional static evaluations.
Contribution
The paper presents MathDuels, a novel self-play benchmark with a three-stage problem generation pipeline and Rasch model analysis, enabling more differentiated evaluation of LLM math skills.
Findings
Authoring and solving abilities are partially decoupled.
Dual-role evaluation uncovers capability differences hidden in single-role tests.
Benchmark difficulty co-evolves with model strength, preventing saturation.
Abstract
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
