A Survey on Knowledge Distillation of Large Language Models
Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang, Li, Can Xu, Dacheng Tao, Tianyi Zhou

TL;DR
This survey comprehensively reviews knowledge distillation techniques for large language models, emphasizing their role in model compression, skill enhancement, and the interplay with data augmentation to improve open-source LLMs.
Contribution
It provides a detailed overview of KD mechanisms, highlights the integration of data augmentation, and discusses future research directions in the context of LLMs.
Findings
KD enables open-source models to approximate proprietary LLM capabilities
Data augmentation significantly enhances KD effectiveness
The survey offers a structured framework for KD in LLMs
Abstract
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Absolute Position Encodings
