Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Boyu Shi, YiCheng Jiang, Chang Liu, Qiufeng Wang, Xu Yang, Xin Geng

TL;DR
The paper introduces Chain-based Distillation (CBD), a scalable method for efficiently initializing variable-sized language models by constructing a knowledge transfer chain with intermediate anchors, improving efficiency and performance.
Contribution
CBD is a novel scalable distillation paradigm that enables efficient initialization of variable-sized language models through a chain of intermediate models and parameter interpolation.
Findings
CBD outperforms scratch training on a 10B-token corpus for a 138M-parameter model.
CBD improves efficiency and downstream performance of small language models.
CBD demonstrates versatility across different architectures and vocabularies.
Abstract
Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbf{Chain-based Distillation (CBD)}, a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emph{bridge distillation} for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
