Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou; Bryan Perozzi; Steven Skiena

arXiv:1307.1662·cs.CL·June 30, 2014·308 cites

Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou, Bryan Perozzi, Steven Skiena

PDF

Open Access

TL;DR

This paper introduces multilingual word embeddings trained on over 100 languages' Wikipedias, demonstrating their effectiveness in NLP tasks and analyzing their semantic properties.

Contribution

It presents a large-scale multilingual embedding resource and evaluates its utility in part-of-speech tagging and semantic analysis.

Findings

01

Embeddings perform competitively in POS tagging for English, Danish, and Swedish.

02

Semantic groupings in embeddings reflect meaningful linguistic relationships.

03

Public release of embeddings to support multilingual NLP research.

Abstract

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification