Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

Xiao Chen; Sihang Zhou; Ke Liang; Xiaoyu Sun; Xinwang Liu

arXiv:2505.18642·cs.CL·May 27, 2025

Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, Xinwang Liu

PDF

1 Video

TL;DR

This paper introduces chunk-wise training and skip-thinking distillation to improve small language models' reasoning speed and accuracy by focusing on core reasoning chunks and skipping non-essential parts during training and inference.

Contribution

The paper proposes chunk-wise training and skip-thinking distillation, enabling smaller models to reason faster and more accurately by isolating and emphasizing core reasoning segments.

Findings

01

Enhanced reasoning speed in small models

02

Maintained or improved reasoning accuracy

03

Effective across multiple tasks and models

Abstract

Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster· underline

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings