Learning Word Vectors for 157 Languages
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas, Mikolov

TL;DR
This paper presents the training of high-quality word vectors for 157 languages using Wikipedia and Common Crawl data, introducing new evaluation datasets and demonstrating strong performance across multiple languages.
Contribution
It is the first large-scale effort to produce and evaluate word embeddings for 157 languages, including new analogy datasets for French, Hindi, and Polish.
Findings
Strong performance on existing evaluation datasets
Introduction of new analogy datasets for three languages
Coverage of 157 languages with high-quality embeddings
Abstract
Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/fasttext-language-identificationmodel· 305k dl· ♡ 258305k dl♡ 258
- 🤗facebook/fasttext-en-vectorsmodel· 451 dl· ♡ 18451 dl♡ 18
- 🤗facebook/fasttext-ko-vectorsmodel· 19 dl· ♡ 1019 dl♡ 10
- 🤗facebook/fasttext-af-vectorsmodel· 2 dl2 dl
- 🤗facebook/fasttext-sq-vectorsmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗facebook/fasttext-als-vectorsmodel· 2 dl2 dl
- 🤗facebook/fasttext-am-vectorsmodel· 2 dl2 dl
- 🤗facebook/fasttext-ar-vectorsmodel· 9 dl· ♡ 69 dl♡ 6
- 🤗facebook/fasttext-an-vectorsmodel· 3 dl3 dl
- 🤗facebook/fasttext-hy-vectorsmodel· 2 dl· ♡ 12 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
