LearningWord Embeddings for Low-resource Languages by PU Learning
Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang

TL;DR
This paper introduces a PU-Learning method to effectively learn word embeddings from very limited corpora in low-resource languages by leveraging zero co-occurrence entries as valuable information.
Contribution
It proposes a novel PU-Learning approach for co-occurrence matrix factorization tailored for low-resource language word embedding learning.
Findings
Effective embeddings learned from small corpora.
PU-Learning outperforms traditional negative sampling methods.
Validated on four different low-resource languages.
Abstract
Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
