LearningWord Embeddings for Low-resource Languages by PU Learning

Chao Jiang; Hsiang-Fu Yu; Cho-Jui Hsieh; Kai-Wei Chang

arXiv:1805.03366·cs.CL·May 10, 2018·1 cites

LearningWord Embeddings for Low-resource Languages by PU Learning

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a PU-Learning method to effectively learn word embeddings from very limited corpora in low-resource languages by leveraging zero co-occurrence entries as valuable information.

Contribution

It proposes a novel PU-Learning approach for co-occurrence matrix factorization tailored for low-resource language word embedding learning.

Findings

01

Effective embeddings learned from small corpora.

02

PU-Learning outperforms traditional negative sampling methods.

03

Validated on four different low-resource languages.

Abstract

Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uclanlp/PU-Learning-for-Word-Embedding
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies