Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora and, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

TL;DR
This paper introduces the MATH dataset of 12,500 challenging math problems with solutions to evaluate and improve machine learning models' mathematical reasoning, highlighting current limitations despite scaling efforts.
Contribution
The paper presents the MATH dataset and an auxiliary pretraining dataset, providing new benchmarks and insights into the challenges of scaling models for mathematical problem solving.
Findings
Accuracy remains low even with large Transformer models.
Scaling models alone is insufficient for advanced mathematical reasoning.
New algorithmic approaches are needed beyond scaling Transformers.
Abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3-4b-itmodel· 1.5M dl· ♡ 12721.5M dl♡ 1272
- 🤗google/gemma-3-27b-itmodel· 1.0M dl· ♡ 19401.0M dl♡ 1940
- 🤗unsloth/gemma-3-12b-it-GGUFmodel· 101k dl· ♡ 178101k dl♡ 178
- 🤗google/gemma-3-1b-itmodel· 1.4M dl· ♡ 8991.4M dl♡ 899
- 🤗google/gemma-3-12b-it-qat-q4_0-ggufmodel· 7.1k dl· ♡ 2627.1k dl♡ 262
- 🤗google/gemma-3-270mmodel· 83k dl· ♡ 100383k dl♡ 1003
- 🤗google/gemma-2-2b-itmodel· 368k dl· ♡ 1314368k dl♡ 1314
- 🤗google/gemma-3-12b-itmodel· 2.6M dl· ♡ 6982.6M dl♡ 698
- 🤗google/gemma-3-12b-it-qat-q4_0-unquantizedmodel· 28k dl· ♡ 8128k dl♡ 81
- 🤗p-e-w/gemma-3-12b-it-hereticmodel· 2.4k dl· ♡ 792.4k dl♡ 79
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Is All You Need · Dropout · Residual Connection · Adam · Byte Pair Encoding
