Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning   Small Language Models

Yao Fu; Yin Yu; Xiaotian Han; Runchao Li; Xianxuan Long; Haotian Yu,; Pan Li

arXiv:2411.16991·cs.CL·November 27, 2024

Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Yao Fu, Yin Yu, Xiaotian Han, Runchao Li, Xianxuan Long, Haotian Yu,, Pan Li

PDF

Open Access

TL;DR

This paper introduces DynSDPB, a model-agnostic self-distillation method for fine-tuning small language models that dynamically adjusts to improve learning without requiring complex teacher models.

Contribution

The paper proposes DynSDPB, a novel, adaptable self-distillation approach for small language models that does not depend on architectural modifications or external teachers.

Findings

01

Outperforms existing self-distillation methods on various benchmarks

02

Effective for both encoder-only and decoder-only language models

03

Enhances fine-tuning adaptability and accuracy

Abstract

Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Adam · Residual Connection · Weight Decay · Softmax · Multi-Head Attention