Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning
Apoorv Dankar, Adeem Jassani, Kartikaeya Kumar

TL;DR
This paper enhances BERT model compression through improved knowledge distillation techniques, including novel loss functions, mapping methods, and weight tuning, tested on GLUE tasks to boost efficiency and accuracy.
Contribution
It introduces new methods for loss functions, layer mapping, and weight tuning in knowledge distillation for BERT, advancing model compression techniques.
Findings
Improved accuracy on GLUE tasks
Enhanced distillation efficiency
Effective layer mapping strategies
Abstract
The use of large transformer-based models such as BERT, GPT, and T5 has led to significant advancements in natural language processing. However, these models are computationally expensive, necessitating model compression techniques that reduce their size and complexity while maintaining accuracy. This project investigates and applies knowledge distillation for BERT model compression, specifically focusing on the TinyBERT student model. We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss and evaluate our proposed techniques on a selection of downstream tasks from the GLUE benchmark. The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Weight Decay · Linear Layer · WordPiece · Discriminative Fine-Tuning · Layer Normalization · Linear Warmup With Cosine Annealing · Gated Linear Unit
