Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Yao Chen; Jiawei Sheng; Wenyuan Zhang; Tingwen Liu

arXiv:2604.15701·cs.CL·April 20, 2026

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Yao Chen, Jiawei Sheng, Wenyuan Zhang, Tingwen Liu

PDF

1 Video

TL;DR

This paper presents a novel Chain-of-Thought distillation method that transfers stepwise attention on key information and uses a Mixture of Layers module to enhance reasoning in small models, achieving consistent improvements.

Contribution

It introduces a new CoT distillation framework that incorporates stepwise attention transfer and dynamic layer alignment, a first in small model reasoning enhancement.

Findings

01

Achieves consistent performance improvements across multiple reasoning datasets.

02

Leverages stepwise attention shifts to guide small models during reasoning.

03

Introduces a Mixture of Layers module for dynamic layer alignment.

Abstract

The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving Reasoning Capabilities in Small Models through Mixture-of-layers Distillation with Stepwise Attention on Key Information· underline