Revisiting Intermediate Layer Distillation for Compressing Language   Models: An Overfitting Perspective

Jongwoo Ko; Seungjoon Park; Minchan Jeong; Sukjin Hong; Euijai Ahn,; Du-Seong Chang; Se-Young Yun

arXiv:2302.01530·cs.CL·February 6, 2023

Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

Jongwoo Ko, Seungjoon Park, Minchan Jeong, Sukjin Hong, Euijai Ahn,, Du-Seong Chang, Se-Young Yun

PDF

Open Access 1 Repo

TL;DR

This paper identifies overfitting issues in intermediate layer distillation for language models and proposes a regularized method that improves performance on benchmarks by focusing on the last layer and auxiliary tasks.

Contribution

It introduces a simple consistency-regularized ILD method that mitigates overfitting and enhances knowledge distillation effectiveness for language models.

Findings

01

CR-ILD outperforms existing KD methods on GLUE benchmark

02

Distilling only the last Transformer layer improves generalization

03

Using supplementary tasks reduces overfitting in ILD

Abstract

Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jongwooko/cr-ild
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Attention Dropout · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Absolute Position Encodings · Byte Pair Encoding