TL;DR
This paper presents a novel Chain-of-Thought distillation method that transfers stepwise attention on key information and uses a Mixture of Layers module to enhance reasoning in small models, achieving consistent improvements.
Contribution
It introduces a new CoT distillation framework that incorporates stepwise attention transfer and dynamic layer alignment, a first in small model reasoning enhancement.
Findings
Achieves consistent performance improvements across multiple reasoning datasets.
Leverages stepwise attention shifts to guide small models during reasoning.
Introduces a Mixture of Layers module for dynamic layer alignment.
Abstract
The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
