TinyGSM: achieving >80% on GSM8k with small language models
Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni,, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang

TL;DR
This paper demonstrates that small language models can achieve over 80% accuracy on GSM8K math problems by using a high-quality synthetic dataset and a verification process, challenging the notion that larger models are necessary.
Contribution
Introducing TinyGSM, a synthetic dataset of 12.3 million math problems, and showing that small models fine-tuned on this data can outperform larger models on GSM8K.
Findings
Small models (1.3B parameters) achieve 81.5% accuracy after fine-tuning.
High-quality synthetic datasets can enable small models to excel in mathematical reasoning.
A verifier component improves final output accuracy by selecting the best candidate.
Abstract
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Educational Assessment and Pedagogy · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Byte Pair Encoding · 15 Ways to Contact How can i speak to someone at Delta Airlines · Layer Normalization · Softmax · Residual Connection
