TL;DR
This paper presents a method to combine static and contextual multilingual embeddings, enhancing cross-lingual representations without needing parallel text, and demonstrates improvements on semantic tasks.
Contribution
It introduces a novel approach that combines static and contextual embeddings, including a new continued pre-training method that leverages aligned static embeddings for better multilingual representations.
Findings
High-quality static embeddings for 40 languages extracted from XLM-R.
Improved performance on complex semantic tasks using combined embeddings.
Continued pre-training without parallel text enhances multilingual alignment.
Abstract
Static and contextual multilingual embeddings have complementary strengths. Static embeddings, while less expressive than contextual language models, can be more straightforwardly aligned across multiple languages. We combine the strengths of static and contextual models to improve multilingual representations. We extract static embeddings for 40 languages from XLM-R, validate those embeddings with cross-lingual word retrieval, and then align them using VecMap. This results in high-quality, highly multilingual static embeddings. Then we apply a novel continued pre-training approach to XLM-R, leveraging the high quality alignment of our static embeddings to better align the representation space of XLM-R. We show positive results for multiple complex semantic tasks. We release the static embeddings and the continued pre-training code. Unlike most previous work, our continued pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsXLM-R
