JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training   Small Data Synthesis Models

Kun Zhou; Beichen Zhang; Jiapeng Wang; Zhipeng Chen; Wayne Xin Zhao,; Jing Sha; Zhichao Sheng; Shijin Wang; Ji-Rong Wen

arXiv:2405.14365·cs.CL·May 24, 2024·1 cites

JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

Kun Zhou, Beichen Zhang, Jiapeng Wang, Zhipeng Chen, Wayne Xin Zhao,, Jing Sha, Zhichao Sheng, Shijin Wang, Ji-Rong Wen

PDF

Open Access 1 Repo 6 Models

TL;DR

JiuZhang3.0 introduces an efficient method to train a small language model for math problem synthesis by distilling GPT-4's capabilities, enabling high-quality data generation with reduced costs and achieving state-of-the-art reasoning performance.

Contribution

The paper presents a cost-effective approach to train a small LLM for math problem synthesis using knowledge distillation from GPT-4 and data selection techniques.

Findings

01

JiuZhang3.0 generates 6 million high-quality math problems for pre-training.

02

It achieves state-of-the-art results on multiple mathematical reasoning benchmarks.

03

The method significantly reduces reliance on large-scale data and expensive model training.

Abstract

Mathematical reasoning is an important capability of large language models~(LLMs) for real-world applications. To enhance this capability, existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs (\eg GPT-4) to synthesize massive math problems. Both types of work generally lead to large costs in training or synthesis. To reduce the cost, based on open-source available texts, we propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. To achieve it, we create a dataset using GPT-4 to distill its data synthesis capability into the small LLM. Concretely, we craft a set of prompts based on human education stages to guide GPT-4, to synthesize problems covering diverse math knowledge and difficulty levels. Besides, we adopt the gradient-based influence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rucaibox/jiuzhang3.0
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Parallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings