Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders
Minsoo Kim, Sihwa Lee, Sukjin Hong, Du-Seong Chang, Jungwook Choi

TL;DR
This paper analyzes how knowledge distillation can be optimized for quantization-aware training of large Transformer models, proposing new methods that improve accuracy with ultra-low precision weights.
Contribution
It introduces attention-map and attention-output distillation losses and unifies them, enhancing QAT performance for large Transformers with sub-2-bit weights.
Findings
Proposed KD methods outperform previous approaches.
Achieved state-of-the-art accuracy in 2-bit quantized Transformers.
Improved attention recovery in quantized models.
Abstract
Knowledge distillation (KD) has been a ubiquitous method for model compression to strengthen the capability of a lightweight model with the transferred knowledge from the teacher. In particular, KD has been employed in quantization-aware training (QAT) of Transformer encoders like BERT to improve the accuracy of the student model with the reduced-precision weight parameters. However, little is understood about which of the various KD approaches best fits the QAT of Transformers. In this work, we provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers. In particular, we reveal that the previously adopted MSE loss on the attention score is insufficient for recovering the self-attention information. Therefore, we propose two KD methods; attention-map and attention-output losses. Furthermore, we explore the unification of both losses to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Model Reduction and Neural Networks
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Weight Decay · Adam · Linear Layer · Dense Connections · Residual Connection · Byte Pair Encoding · Attention Dropout
