TL;DR
This paper introduces Evolving Knowledge Distillation, a progressive training method that enables lightweight neural machine translation models to approach the performance of larger models by learning from a sequence of increasingly capable teachers.
Contribution
The paper proposes EKD, a novel progressive training framework that effectively bridges the capacity gap in knowledge distillation for NMT models.
Findings
EKD improves translation quality across multiple benchmarks.
The final student model achieves BLEU scores close to the strongest teacher.
EKD consistently narrows the performance gap between small and large models.
Abstract
Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
