Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective
Jongwoo Ko, Seungjoon Park, Minchan Jeong, Sukjin Hong, Euijai Ahn,, Du-Seong Chang, Se-Young Yun

TL;DR
This paper identifies overfitting issues in intermediate layer distillation for language models and proposes a regularized method that improves performance on benchmarks by focusing on the last layer and auxiliary tasks.
Contribution
It introduces a simple consistency-regularized ILD method that mitigates overfitting and enhances knowledge distillation effectiveness for language models.
Findings
CR-ILD outperforms existing KD methods on GLUE benchmark
Distilling only the last Transformer layer improves generalization
Using supplementary tasks reduces overfitting in ILD
Abstract
Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Attention Dropout · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Absolute Position Encodings · Byte Pair Encoding
