Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Grigory Kovalev; Mikhail Tikhomirov

arXiv:2511.05085·cs.CL·November 10, 2025

Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Grigory Kovalev, Mikhail Tikhomirov

PDF

Open Access

TL;DR

This paper presents an iterative layer-wise distillation method for compressing large language models, reducing layers significantly while maintaining high performance, thus enabling efficient deployment in resource-constrained environments.

Contribution

The paper introduces an improved iterative distillation technique based on layer importance evaluation, achieving substantial model compression with minimal performance loss.

Findings

01

Reduced layers from 36 to 28 with only 9.7% quality loss

02

Further reduction to 24 layers results in 18% performance degradation

03

Middle transformer layers are less critical for inference

Abstract

This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling