Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation
Hoang-Chau Luong, Dat Ba Tran, and Lingwei Chen

TL;DR
This paper introduces Diversity-aware RKL (DRKL), a novel distillation objective for large language models that improves output diversity and tail class alignment over existing RKL methods.
Contribution
The paper analyzes RKL's limitations and proposes DRKL, which enhances diversity and tail class performance in LLM distillation.
Findings
DRKL outperforms FKL, RKL, and other objectives in various datasets.
DRKL achieves a better fidelity-diversity trade-off.
Extensive experiments validate DRKL's effectiveness across model families.
Abstract
Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
