OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Shubham Toshniwal; Ivan Moshkov; Sean Narenthiran; Daria Gitman; Fei; Jia; Igor Gitman

arXiv:2402.10176·cs.CL·November 5, 2024·3 cites

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei, Jia, Igor Gitman

PDF

Open Access 1 Repo 10 Models 5 Datasets 1 Video

TL;DR

This paper introduces OpenMathInstruct-1, a large open-source math instruction dataset with 1.8 million problem-solution pairs, enabling open models to achieve competitive math reasoning performance.

Contribution

It presents a new large-scale open-source math dataset created using open LLMs, bridging the performance gap with closed-source models.

Findings

01

OpenMath-CodeLlama-70B achieves 84.6% on GSM8K

02

The dataset enables open models to perform competitively in math reasoning

03

Code, models, and dataset are publicly released under permissive license

Abstract

Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kipok/nemo-skills
noneOfficial

Models

Datasets

Videos

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset· slideslive

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing

MethodsPosition-Wise Feed-Forward Layer · Attention Is All You Need · Dropout · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Multi-Head Attention