Improving Knowledge Distillation for BERT Models: Loss Functions,   Mapping Methods, and Weight Tuning

Apoorv Dankar; Adeem Jassani; Kartikaeya Kumar

arXiv:2308.13958·cs.CL·August 29, 2023·1 cites

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

Apoorv Dankar, Adeem Jassani, Kartikaeya Kumar

PDF

Open Access

TL;DR

This paper enhances BERT model compression through improved knowledge distillation techniques, including novel loss functions, mapping methods, and weight tuning, tested on GLUE tasks to boost efficiency and accuracy.

Contribution

It introduces new methods for loss functions, layer mapping, and weight tuning in knowledge distillation for BERT, advancing model compression techniques.

Findings

01

Improved accuracy on GLUE tasks

02

Enhanced distillation efficiency

03

Effective layer mapping strategies

Abstract

The use of large transformer-based models such as BERT, GPT, and T5 has led to significant advancements in natural language processing. However, these models are computationally expensive, necessitating model compression techniques that reduce their size and complexity while maintaining accuracy. This project investigates and applies knowledge distillation for BERT model compression, specifically focusing on the TinyBERT student model. We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss and evaluate our proposed techniques on a selection of downstream tasks from the GLUE benchmark. The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Weight Decay · Linear Layer · WordPiece · Discriminative Fine-Tuning · Layer Normalization · Linear Warmup With Cosine Annealing · Gated Linear Unit