BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
Minchong Li, Feng Zhou, Xiaohui Song

TL;DR
This paper introduces BiLD, a novel loss function for large language model distillation that filters long-tail noise and leverages internal ranking, significantly improving performance over existing methods across multiple datasets.
Contribution
The paper proposes the BiLD loss, which effectively utilizes top-k logits and internal ranking information to enhance LLM distillation performance.
Findings
BiLD outperforms existing distillation methods on 13 datasets.
Filtering long-tail noise improves distillation quality.
Using internal ranking information enhances logits utilization.
Abstract
In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsKnowledge Distillation
