Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning

Simon Frieder; Jonas Bayer; Sam Looi; Jacob Loader; Julius Berner; Katherine M. Collins; Andr\'as Juh\'asz; Fabian Ruehle; Sean Welleck; Gabriel Poesia; Ryan-Rhys Griffiths; Adrian Weller; Anirudh Goyal; Cameron Freer; Thomas Lukasiewicz; Timothy Gowers

arXiv:2412.15184·cs.LG·December 22, 2025

Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning

Simon Frieder, Jonas Bayer, Sam Looi, Jacob Loader, Julius Berner, Katherine M. Collins, Andr\'as Juh\'asz, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Cameron Freer, Thomas Lukasiewicz, Timothy Gowers

PDF

Open Access

TL;DR

This paper critiques current mathematical datasets for AI copilots, highlighting their limitations and proposing richer, process-oriented benchmarks that include motivation and proof discovery to better evaluate and develop mathematical reasoning in AI models.

Contribution

It introduces the concept of 'motivated proof' and advocates for new datasets that focus on proof processes, not just final results, to improve AI mathematical capabilities.

Findings

01

Current benchmarks are limited and can mislead model development.

02

Proposed datasets include proof motivation and discovery processes.

03

Enhancing datasets can lead to better mathematical reasoning in AI models.

Abstract

The datasets and benchmarks commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings and misdirections. These range from a restricted scope of mathematical complexity to limited fidelity in capturing aspects beyond the final, written proof (e.g. motivating the proof, or representing the thought processes leading to a proof). These issues are compounded by a dynamic reminiscent of Goodhart's law: as benchmark performance becomes the primary target for model development, the benchmarks themselves become less reliable indicators of genuine mathematical capability. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or ``thought partners''),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical Methods and Algorithms · Mathematics, Computing, and Information Processing · Polynomial and algebraic computation

MethodsAttentive Walk-Aggregating Graph Neural Network