Massively Multilingual Word Embeddings

Waleed Ammar; George Mulcaire; Yulia Tsvetkov; Guillaume Lample; Chris; Dyer; Noah A. Smith

arXiv:1602.01925·cs.CL·May 24, 2016·282 cites

Massively Multilingual Word Embeddings

Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris, Dyer, Noah A. Smith

PDF

Open Access 1 Repo

TL;DR

This paper presents new methods for creating and evaluating multilingual word embeddings across over fifty languages without parallel data, demonstrating improved evaluation correlation and providing open-source tools.

Contribution

It introduces multiCluster and multiCCA estimation methods and multiQVEC-CCA evaluation, advancing multilingual embedding research without relying on parallel corpora.

Findings

01

multiQVEC-CCA correlates better with downstream tasks

02

Methods work effectively across 50+ languages

03

Open-source tools facilitate further research

Abstract

We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space. Our estimation methods, multiCluster and multiCCA, use dictionaries and monolingual data; they do not require parallel data. Our new evaluation method, multiQVEC-CCA, is shown to correlate better than previous ones with two downstream tasks (text categorization and parsing). We also describe a web portal for evaluation that will facilitate further research in this area, along with open-source releases of all our methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idiap/mhan
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling