A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu; Ming Li; Chongyang Tao; Tao Shen; Reynold Cheng; Jinyang; Li; Can Xu; Dacheng Tao; Tianyi Zhou

arXiv:2402.13116·cs.CL·October 22, 2024·51 cites

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang, Li, Can Xu, Dacheng Tao, Tianyi Zhou

PDF

Open Access 1 Repo

TL;DR

This survey comprehensively reviews knowledge distillation techniques for large language models, emphasizing their role in model compression, skill enhancement, and the interplay with data augmentation to improve open-source LLMs.

Contribution

It provides a detailed overview of KD mechanisms, highlights the integration of data augmentation, and discusses future research directions in the context of LLMs.

Findings

01

KD enables open-source models to approximate proprietary LLM capabilities

02

Data augmentation significantly enhances KD effectiveness

03

The survey offers a structured framework for KD in LLMs

Abstract

In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tebmer/awesome-knowledge-distillation-of-llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Absolute Position Encodings