JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models
Kun Zhou, Beichen Zhang, Jiapeng Wang, Zhipeng Chen, Wayne Xin Zhao,, Jing Sha, Zhichao Sheng, Shijin Wang, Ji-Rong Wen

TL;DR
JiuZhang3.0 introduces an efficient method to train a small language model for math problem synthesis by distilling GPT-4's capabilities, enabling high-quality data generation with reduced costs and achieving state-of-the-art reasoning performance.
Contribution
The paper presents a cost-effective approach to train a small LLM for math problem synthesis using knowledge distillation from GPT-4 and data selection techniques.
Findings
JiuZhang3.0 generates 6 million high-quality math problems for pre-training.
It achieves state-of-the-art results on multiple mathematical reasoning benchmarks.
The method significantly reduces reliance on large-scale data and expensive model training.
Abstract
Mathematical reasoning is an important capability of large language models~(LLMs) for real-world applications. To enhance this capability, existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs (\eg GPT-4) to synthesize massive math problems. Both types of work generally lead to large costs in training or synthesis. To reduce the cost, based on open-source available texts, we propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. To achieve it, we create a dataset using GPT-4 to distill its data synthesis capability into the small LLM. Concretely, we craft a set of prompts based on human education stages to guide GPT-4, to synthesize problems covering diverse math knowledge and difficulty levels. Besides, we adopt the gradient-based influence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ToheartZhang/JiuZhang3.0-7Bmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗ToheartZhang/JiuZhang3.0-Synthesis-7Bmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗ToheartZhang/JiuZhang3.0-8x7Bmodel
- 🤗ToheartZhang/JiuZhang3.0-8Bmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗RichardErkhov/ToheartZhang_-_JiuZhang3.0-Synthesis-7B-ggufmodel· 515 dl515 dl
- 🤗RichardErkhov/ToheartZhang_-_JiuZhang3.0-7B-ggufmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Parallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings
