TL;DR
This paper investigates various sampling and filtering techniques for neural machine translation distillation data, demonstrating that strategic upsampling and data combination improve translation quality.
Contribution
It systematically explores sampling, pruning, and deduplication methods, showing their impact on MT distillation performance with empirical results.
Findings
Upsampling and combining data improve translation quality.
Careful data filtering outperforms naive data mixing.
Method enhances distillation effectiveness in MT models.
Abstract
In most of neural machine translation distillation or stealing scenarios, the goal is to preserve the performance of the target model (teacher). The highest-scoring hypothesis of the teacher model is commonly used to train a new model (student). If reference translations are also available, then better hypotheses (with respect to the references) can be upsampled and poor hypotheses either removed or undersampled. This paper explores the importance sampling method landscape (pruning, hypothesis upsampling and undersampling, deduplication and their combination) with English to Czech and English to German MT models using standard MT evaluation metrics. We show that careful upsampling and combination with the original data leads to better performance when compared to training only on the original or synthesized data or their direct combination.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
