TinyGSM: achieving >80% on GSM8k with small language models

Bingbin Liu; Sebastien Bubeck; Ronen Eldan; Janardhan Kulkarni,; Yuanzhi Li; Anh Nguyen; Rachel Ward; Yi Zhang

arXiv:2312.09241·cs.LG·December 15, 2023·1 cites

TinyGSM: achieving >80% on GSM8k with small language models

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni,, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper demonstrates that small language models can achieve over 80% accuracy on GSM8K math problems by using a high-quality synthetic dataset and a verification process, challenging the notion that larger models are necessary.

Contribution

Introducing TinyGSM, a synthetic dataset of 12.3 million math problems, and showing that small models fine-tuned on this data can outperform larger models on GSM8K.

Findings

01

Small models (1.3B parameters) achieve 81.5% accuracy after fine-tuning.

02

High-quality synthetic datasets can enable small models to excel in mathematical reasoning.

03

A verifier component improves final output accuracy by selecting the best candidate.

Abstract

Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

taesiri/arxiv_qa
dataset· 193 dl
193 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Educational Assessment and Pedagogy · Machine Learning and Data Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Byte Pair Encoding · 15 Ways to Contact How can i speak to someone at Delta Airlines · Layer Normalization · Softmax · Residual Connection