Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages
Sonal Sannigrahi, Rachel Bawden

TL;DR
This paper examines lexical sharing in multilingual machine translation for Indian languages, analyzing the effects of data sampling, vocabulary size, and transliteration on translation performance and cross-script generalization.
Contribution
It provides an empirical analysis of how transliteration and vocabulary choices impact multilingual MT performance for Indian languages, including unseen languages.
Findings
Transliteration does not significantly improve translation quality.
Multilingual models trained on original scripts are robust to cross-script differences.
Trade-offs exist between data sampling strategies and vocabulary size.
Abstract
Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
