The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation
Nikolay Bogoychev, Pinzhen Chen

TL;DR
This paper explores simple, computationally inexpensive lexical domain adaptation methods for neural machine translation, demonstrating their effectiveness in low-resource, out-of-domain scenarios but limitations when domain mismatch is large or data is sufficient.
Contribution
It introduces and evaluates two straightforward lexical adaptation techniques for neural machine translation, highlighting their strengths and limitations in low-resource domain adaptation.
Findings
Effective in low-resource out-of-domain scenarios
Ineffective with large domain mismatch or ample data
IBM alignment-based methods lose advantage over neural models
Abstract
Machine translation systems are vulnerable to domain mismatch, especially in a low-resource scenario. Out-of-domain translations are often of poor quality and prone to hallucinations, due to exposure bias and the decoder acting as a language model. We adopt two approaches to alleviate this problem: lexical shortlisting restricted by IBM statistical alignments, and hypothesis re-ranking based on similarity. The methods are computationally cheap, widely known, but not extensively experimented on domain adaptation. We demonstrate success on low-resource out-of-domain test sets, however, the methods are ineffective when there is sufficient data or too great domain mismatch. This is due to both the IBM model losing its advantage over the implicitly learned neural alignment, and issues with subword segmentation of out-of-domain words.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
