TL;DR
This paper introduces Socratic CoT, a knowledge distillation method that enables small language models to acquire reasoning skills comparable to large models by decomposing problems into subproblems.
Contribution
It presents a novel distillation approach using Socratic CoT to transfer reasoning capabilities from large models to smaller ones, improving their performance significantly.
Findings
Small models outperform baselines by over 70% on reasoning tasks.
Socratic CoT enables smaller models to sometimes surpass larger models in reasoning.
Distilled models perform well across multiple reasoning datasets.
Abstract
Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models. However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work. In this paper, we propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models. In this work, we propose an alternative reasoning scheme, Socratic CoT, that learns a decomposition of the original problem into a sequence of subproblems and uses it to guide the intermediate reasoning steps. We use Socratic CoT to train a combination of two small distilled models: a problem decomposer and a subproblem solver. In practice, given a new problem, the two distilled models work in sync…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Linear Warmup With Cosine Annealing · Softmax · Attention Dropout · Dropout · GPT-2
