Investigating Mysteries of CoT-Augmented Distillation

Somin Wadhwa; Silvio Amir; Byron C. Wallace

arXiv:2406.14511·cs.CL·October 1, 2024

Investigating Mysteries of CoT-Augmented Distillation

Somin Wadhwa, Silvio Amir, Byron C. Wallace

PDF

Open Access

TL;DR

This paper investigates how chain of thought (CoT) rationales improve model distillation, revealing that the position and coherence of rationales are less critical than previously thought, with key tokens sufficing for performance gains.

Contribution

It uncovers that placing CoT sequences after labels enhances distillation, and that only a few key tokens are needed, challenging assumptions about rationale coherence.

Findings

01

Placing CoT after labels improves downstream performance.

02

Rationales do not need to be coherent to be effective.

03

A small number of key tokens can match full rationale performance.

Abstract

Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation? We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels (rather than before) realizes consistently better downstream performance -- this means that no student "reasoning" is necessary at test time to realize gains. (2) When rationales are appended in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProcess Optimization and Integration