Anchor-based Bilingual Word Embeddings for Low-Resource Languages

Tobias Eder; Viktor Hangya; Alexander Fraser

arXiv:2010.12627·cs.CL·July 28, 2021

Anchor-based Bilingual Word Embeddings for Low-Resource Languages

Tobias Eder, Viktor Hangya, Alexander Fraser

PDF

TL;DR

This paper introduces an anchor-based method for creating bilingual word embeddings that leverages high-resource language vectors to improve low-resource language embeddings and bilingual tasks.

Contribution

It proposes a novel approach using source language vectors as anchors to automatically align bilingual embeddings during training for low-resource languages.

Findings

01

Improved bilingual lexicon induction performance.

02

Enhanced monolingual word similarity in low-resource languages.

03

Effective alignment of bilingual embedding spaces.

Abstract

Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text. MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs. For low resource languages training MWEs monolingually results in MWEs of poor quality, and thus poor bilingual word embeddings (BWEs) as well. This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point for training an embedding space for the low resource target language. By using the source vectors as anchors the vector spaces are automatically aligned during training. We experiment on English-German, English-Hiligaynon and English-Macedonian. We show that our approach results not only in improved BWEs and bilingual lexicon induction performance, but also in improved target language MWE quality as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.