TL;DR
This paper investigates whether free-order case-marking languages are more difficult to translate with neural models, finding that word order flexibility has limited impact on translation quality, but resource constraints still favor fixed-order languages.
Contribution
The study introduces a translation challenge set and synthetic languages to analyze the impact of word order and case marking on NMT performance across different resource levels.
Findings
Word order flexibility causes minimal NMT quality loss.
Case marking improves disambiguation in free-order languages.
Fixed-order languages outperform in low-resource settings.
Abstract
Identifying factors that make certain languages harder to model than others is essential to reach language equality in future Natural Language Processing technologies. Free-order case-marking languages, such as Russian, Latin or Tamil, have proved more challenging than fixed-order languages for the tasks of syntactic parsing and subject-verb agreement prediction. In this work, we investigate whether this class of languages is also more difficult to translate by state-of-the-art Neural Machine Translation models (NMT). Using a variety of synthetic languages and a newly introduced translation challenge set, we find that word order flexibility in the source language only leads to a very small loss of NMT quality, even though the core verb arguments become impossible to disambiguate in sentences without semantic cues. The latter issue is indeed solved by the addition of case marking.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
