Iterative Layer-wise Distillation for Efficient Compression of Large Language Models
Grigory Kovalev, Mikhail Tikhomirov

TL;DR
This paper presents an iterative layer-wise distillation method for compressing large language models, reducing layers significantly while maintaining high performance, thus enabling efficient deployment in resource-constrained environments.
Contribution
The paper introduces an improved iterative distillation technique based on layer importance evaluation, achieving substantial model compression with minimal performance loss.
Findings
Reduced layers from 36 to 28 with only 9.7% quality loss
Further reduction to 24 layers results in 18% performance degradation
Middle transformer layers are less critical for inference
Abstract
This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
