Don't Throw Away Data: Better Sequence Knowledge Distillation
Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin, Cherry, Trevor Cohn

TL;DR
This paper enhances sequence knowledge distillation by integrating multiple high-quality MBR translations to better capture teacher output diversity, leading to improved translation performance across language pairs.
Contribution
It introduces a novel method that incorporates multiple MBR translations in distillation, improving over single-sequence approaches and providing insights into data efficiency and capacity issues.
Findings
Consistent improvements in translation quality for English-German and English-Japanese.
Enhanced data efficiency and understanding of capacity limitations.
Potential for further gains with refined MBR integration.
Abstract
A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Semantic Web and Ontologies · Algorithms and Data Compression
MethodsKnowledge Distillation
