Understanding the Effect of Data Augmentation on Knowledge Distillation
Ziqi Wang, Chi Han, Wenxuan Bao, Heng Ji

TL;DR
This paper investigates how data augmentation affects knowledge distillation, revealing that larger semantic shifts in augmented data benefit KD more than traditional fine-tuning approaches, with optimal augmentation degrees varying by dataset size.
Contribution
The study uncovers that knowledge distillation prefers more semantic shift in augmented data compared to fine-tuning, and provides insights into optimal augmentation proportions based on dataset size.
Findings
KD prefers larger semantic shifts than fine-tuning.
Optimal augmentation proportion varies with dataset size.
Smaller datasets benefit from higher degrees of augmentation.
Abstract
Knowledge distillation (KD) requires sufficient data to transfer knowledge from large-scale teacher models to small-scale student models. Therefore, data augmentation has been widely used to mitigate the shortage of data under specific scenarios. Classic data augmentation techniques, such as synonym replacement and k-nearest-neighbors, are initially designed for fine-tuning. To avoid severe semantic shifts and preserve task-specific labels, those methods prefer to change only a small proportion of tokens (e.g., changing 10% tokens is generally the best option for fine-tuning). However, such data augmentation methods are sub-optimal for knowledge distillation since the teacher model could provide label distributions and is more tolerant to semantic shifts. We first observe that KD prefers as much data as possible, which is different from fine-tuning that too much data will not gain more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Online Learning and Analytics · Intelligent Tutoring Systems and Adaptive Learning
MethodsKnowledge Distillation
