TL;DR
This paper investigates hallucinations in Neural Machine Translation, linking them to source perturbations and corpus noise, and explains their amplification in common data augmentation methods.
Contribution
It introduces a unified explanation for hallucinations in NMT related to source and corpus noise, and analyzes their amplification in data-generation processes.
Findings
Hallucinations linked to source perturbation explained by Long-Tail theory
Corpus-level noise can generate natural hallucinations like detached and oscillatory outputs
Hallucination amplification occurs in backtranslation and knowledge distillation processes
Abstract
In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman (2020), and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
