Understanding and Improving Knowledge Distillation for   Quantization-Aware Training of Large Transformer Encoders

Minsoo Kim; Sihwa Lee; Sukjin Hong; Du-Seong Chang; Jungwook Choi

arXiv:2211.11014·cs.CL·November 22, 2022

Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

Minsoo Kim, Sihwa Lee, Sukjin Hong, Du-Seong Chang, Jungwook Choi

PDF

Open Access 1 Repo

TL;DR

This paper analyzes how knowledge distillation can be optimized for quantization-aware training of large Transformer models, proposing new methods that improve accuracy with ultra-low precision weights.

Contribution

It introduces attention-map and attention-output distillation losses and unifies them, enhancing QAT performance for large Transformers with sub-2-bit weights.

Findings

01

Proposed KD methods outperform previous approaches.

02

Achieved state-of-the-art accuracy in 2-bit quantized Transformers.

03

Improved attention recovery in quantized models.

Abstract

Knowledge distillation (KD) has been a ubiquitous method for model compression to strengthen the capability of a lightweight model with the transferred knowledge from the teacher. In particular, KD has been employed in quantization-aware training (QAT) of Transformer encoders like BERT to improve the accuracy of the student model with the reduced-precision weight parameters. However, little is understood about which of the various KD approaches best fits the QAT of Transformers. In this work, we provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers. In particular, we reveal that the previously adopted MSE loss on the attention score is insufficient for recovering the self-attention information. Therefore, we propose two KD methods; attention-map and attention-output losses. Furthermore, we explore the unification of both losses to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marsjacobs/kd-qat-large-enc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Model Reduction and Neural Networks

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Weight Decay · Adam · Linear Layer · Dense Connections · Residual Connection · Byte Pair Encoding · Attention Dropout