Distilling Reasoning Capabilities into Smaller Language Models

Kumar Shridhar; Alessandro Stolfo; Mrinmaya Sachan

arXiv:2212.00193·cs.LG·May 19, 2023

Distilling Reasoning Capabilities into Smaller Language Models

Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan

PDF

1 Repo

TL;DR

This paper introduces Socratic CoT, a knowledge distillation method that enables small language models to acquire reasoning skills comparable to large models by decomposing problems into subproblems.

Contribution

It presents a novel distillation approach using Socratic CoT to transfer reasoning capabilities from large models to smaller ones, improving their performance significantly.

Findings

01

Small models outperform baselines by over 70% on reasoning tasks.

02

Socratic CoT enables smaller models to sometimes surpass larger models in reasoning.

03

Distilled models perform well across multiple reasoning datasets.

Abstract

Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models. However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work. In this paper, we propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models. In this work, we propose an alternative reasoning scheme, Socratic CoT, that learns a decomposition of the original problem into a sequence of subproblems and uses it to guide the intermediate reasoning steps. We use Socratic CoT to train a combination of two small distilled models: a problem decomposer and a subproblem solver. In practice, given a new problem, the two distilled models work in sync…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kumar-shridhar/distiiling-lm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Linear Warmup With Cosine Annealing · Softmax · Attention Dropout · Dropout · GPT-2