Discovering Representation Sprachbund For Multilingual Pre-Training
Yimin Fan, Yaobo Liang, Alexandre Muzio, Hany Hassan, Houqiang Li,, Ming Zhou, Nan Duan

TL;DR
This paper introduces a novel multilingual pre-training approach that groups languages into representation sprachbunds based on linguistic similarities, training separate models for each group to improve cross-lingual NLP performance.
Contribution
The paper proposes a new multilingual pre-training pipeline that clusters languages into representation sprachbunds based on linguistic analysis, enhancing model performance across diverse languages.
Findings
Models trained on representation sprachbunds outperform baseline models.
Linguistic similarity correlates with representation similarity in multilingual models.
Significant improvements on cross-lingual benchmarks were observed.
Abstract
Multilingual pre-trained models have demonstrated their effectiveness in many multilingual NLP tasks and enabled zero-shot or few-shot transfer from high-resource languages to low resource ones. However, due to significant typological differences and contradictions between some languages, such models usually perform poorly on many languages and cross-lingual settings, which shows the difficulty of learning a single model to handle massive diverse languages well at the same time. To alleviate this issue, we present a new multilingual pre-training pipeline. We propose to generate language representation from multilingual pre-trained models and conduct linguistic analysis to show that language representation similarity reflect linguistic similarity from multiple perspectives, including language family, geographical sprachbund, lexicostatistics and syntax. Then we cluster all the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
