Specializing Smaller Language Models towards Multi-Step Reasoning

Yao Fu; Hao Peng; Litu Ou; Ashish Sabharwal; Tushar Khot

arXiv:2301.12726·cs.CL·January 31, 2023·43 cites

Specializing Smaller Language Models towards Multi-Step Reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

PDF

Open Access 2 Repos 3 Datasets

TL;DR

This paper demonstrates that small language models can be specialized to perform multi-step reasoning effectively by distilling abilities from larger models, focusing their capacity on specific tasks to improve performance.

Contribution

It introduces model specialization for small models, showing how to enhance multi-step reasoning by distilling from large models and optimizing design choices.

Findings

01

Specialized small models outperform general small models in multi-step reasoning.

02

There is a tradeoff between general ability and task-specific performance.

03

Effective distillation can transfer complex reasoning skills to smaller models.

Abstract

The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ( $\geq$ 175B) to T5 variants ( $\leq$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms

Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Inverse Square Root Schedule · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Softmax · Cosine Annealing · Attention Dropout · SentencePiece