How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
Weijia Xu, Shuming Ma, Dongdong Zhang, Marine Carpuat

TL;DR
This paper investigates how the complexity of distilled training data affects the performance and confidence calibration of non-autoregressive machine translation models, revealing that lexical diversity reduction is key to improving quality and confidence.
Contribution
It provides a detailed analysis of how different complexity aspects of distilled data influence NAR translation quality and confidence calibration, highlighting lexical diversity as a crucial factor.
Findings
Reducing lexical diversity improves NAR translation quality.
Decreasing reordering complexity enhances alignment learning.
Lexical diversity reduction mainly boosts model confidence.
Abstract
While non-autoregressive (NAR) models are showing great promise for machine translation, their use is limited by their dependence on knowledge distillation from autoregressive models. To address this issue, we seek to understand why distillation is so effective. Prior work suggests that distilled training data is less complex than manual translations. Based on experiments with the Levenshtein Transformer and the Mask-Predict NAR models on the WMT14 German-English task, this paper shows that different types of complexity have different impacts: while reducing lexical diversity and decreasing reordering complexity both help NAR learn better alignment between source and target, and thus improve translation quality, lexical diversity is the main reason why distillation increases model confidence, which affects the calibration of different NAR models differently.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Knowledge Distillation · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Adam · Label Smoothing
