Investigating Mysteries of CoT-Augmented Distillation
Somin Wadhwa, Silvio Amir, Byron C. Wallace

TL;DR
This paper investigates how chain of thought (CoT) rationales improve model distillation, revealing that the position and coherence of rationales are less critical than previously thought, with key tokens sufficing for performance gains.
Contribution
It uncovers that placing CoT sequences after labels enhances distillation, and that only a few key tokens are needed, challenging assumptions about rationale coherence.
Findings
Placing CoT after labels improves downstream performance.
Rationales do not need to be coherent to be effective.
A small number of key tokens can match full rationale performance.
Abstract
Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation? We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels (rather than before) realizes consistently better downstream performance -- this means that no student "reasoning" is necessary at test time to realize gains. (2) When rationales are appended in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProcess Optimization and Integration
