Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models
Yao Fu, Yin Yu, Xiaotian Han, Runchao Li, Xianxuan Long, Haotian Yu,, Pan Li

TL;DR
This paper introduces DynSDPB, a model-agnostic self-distillation method for fine-tuning small language models that dynamically adjusts to improve learning without requiring complex teacher models.
Contribution
The paper proposes DynSDPB, a novel, adaptable self-distillation approach for small language models that does not depend on architectural modifications or external teachers.
Findings
Outperforms existing self-distillation methods on various benchmarks
Effective for both encoder-only and decoder-only language models
Enhances fine-tuning adaptability and accuracy
Abstract
Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Adam · Residual Connection · Weight Decay · Softmax · Multi-Head Attention
