Understanding and Improving Lexical Choice in Non-Autoregressive Translation
Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao,, Zhaopeng Tu

TL;DR
This paper identifies that knowledge distillation causes lexical choice errors on low-frequency words in non-autoregressive translation models and proposes a method to mitigate this by exposing raw data during training, improving translation quality.
Contribution
It introduces a novel approach that incorporates raw data with an additional KL divergence term to reduce lexical errors on low-frequency words in NAT models.
Findings
Improved BLEU scores on WMT14 English-German and WMT16 Romanian-English datasets.
Reduced lexical choice errors on low-frequency words.
Method is effective across different language pairs and model architectures.
Abstract
Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
