MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan,, Mingjie Zhan, Hongsheng Li

TL;DR
MathCoder2 introduces a novel pretraining approach using high-quality mathematical code and reasoning steps, significantly enhancing large language models' mathematical reasoning abilities.
Contribution
The paper presents a new dataset and training method that incorporate code with reasoning steps, improving mathematical reasoning in language models.
Findings
Training with MathCode-Pile improves model mathematical abilities
Generated code accurately captures mathematical reasoning processes
Open-sourced data and code ensure reproducibility
Abstract
Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MathGenie/MathCoder2-Llama-3-8Bmodel· 25 dl· ♡ 725 dl♡ 7
- 🤗MathGenie/MathCoder2-Mistral-7Bmodel· 12 dl· ♡ 312 dl♡ 3
- 🤗MathGenie/MathCoder2-CodeLlama-7Bmodel· 19 dl· ♡ 519 dl♡ 5
- 🤗MathGenie/MathCoder2-DeepSeekMath-7Bmodel· 17 dl· ♡ 617 dl♡ 6
- 🤗MathGenie/fastText-cc-en-filter_round1model
- 🤗MathGenie/fastText-cc-en-filter_round2model
- 🤗QuantFactory/MathCoder2-Llama-3-8B-GGUFmodel· 122 dl· ♡ 3122 dl♡ 3
- 🤗QuantFactory/MathCoder2-CodeLlama-7B-GGUFmodel· 603 dl· ♡ 2603 dl♡ 2
- 🤗RichardErkhov/MathGenie_-_MathCoder2-CodeLlama-7B-4bitsmodel
- 🤗RichardErkhov/MathGenie_-_MathCoder2-CodeLlama-7B-8bitsmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsBalanced Selection
