Massively Multilingual Word Embeddings
Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris, Dyer, Noah A. Smith

TL;DR
This paper presents new methods for creating and evaluating multilingual word embeddings across over fifty languages without parallel data, demonstrating improved evaluation correlation and providing open-source tools.
Contribution
It introduces multiCluster and multiCCA estimation methods and multiQVEC-CCA evaluation, advancing multilingual embedding research without relying on parallel corpora.
Findings
multiQVEC-CCA correlates better with downstream tasks
Methods work effectively across 50+ languages
Open-source tools facilitate further research
Abstract
We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space. Our estimation methods, multiCluster and multiCCA, use dictionaries and monolingual data; they do not require parallel data. Our new evaluation method, multiQVEC-CCA, is shown to correlate better than previous ones with two downstream tasks (text categorization and parsing). We also describe a web portal for evaluation that will facilitate further research in this area, along with open-source releases of all our methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
