OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei, Jia, Igor Gitman

TL;DR
This paper introduces OpenMathInstruct-1, a large open-source math instruction dataset with 1.8 million problem-solution pairs, enabling open models to achieve competitive math reasoning performance.
Contribution
It presents a new large-scale open-source math dataset created using open LLMs, bridging the performance gap with closed-source models.
Findings
OpenMath-CodeLlama-70B achieves 84.6% on GSM8K
The dataset enables open models to perform competitively in math reasoning
Code, models, and dataset are publicly released under permissive license
Abstract
Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/OpenMath-Mistral-7B-v0.1model· 1 dl· ♡ 121 dl♡ 12
- 🤗nvidia/OpenMath-Mistral-7B-v0.1-hfmodel· 146 dl· ♡ 35146 dl♡ 35
- 🤗nvidia/OpenMath-CodeLlama-7b-Pythonmodel· 26 dl· ♡ 226 dl♡ 2
- 🤗nvidia/OpenMath-CodeLlama-7b-Python-hfmodel· 46 dl· ♡ 846 dl♡ 8
- 🤗nvidia/OpenMath-CodeLlama-13b-Pythonmodel· 18 dl· ♡ 118 dl♡ 1
- 🤗nvidia/OpenMath-CodeLlama-13b-Python-hfmodel· 42 dl· ♡ 142 dl♡ 1
- 🤗nvidia/OpenMath-CodeLlama-34b-Pythonmodel· 45 dl· ♡ 345 dl♡ 3
- 🤗nvidia/OpenMath-CodeLlama-34b-Python-hfmodel· 118 dl· ♡ 1118 dl♡ 1
- 🤗nvidia/OpenMath-Llama-2-70bmodel· 1 dl· ♡ 41 dl♡ 4
- 🤗nvidia/OpenMath-Llama-2-70b-hfmodel· 168 dl· ♡ 3168 dl♡ 3
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing
MethodsPosition-Wise Feed-Forward Layer · Attention Is All You Need · Dropout · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Multi-Head Attention
