A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
Alex Jones, William Yang Wang, Kyle Mahowald

TL;DR
This study investigates factors influencing cross-lingual sentence alignment in shared embedding spaces across 101 languages, revealing key linguistic predictors and the impact of training data composition using large multilingual models.
Contribution
It provides a comprehensive analysis of linguistic and training factors affecting cross-lingual alignment in shared embedding spaces for numerous languages.
Findings
Word order agreement strongly predicts cross-lingual alignment.
Agreement in morphological complexity influences alignment quality.
In-family training data outperforms language-specific data in alignment metrics.
Abstract
In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
