Learning Contextualised Cross-lingual Word Embeddings and Alignments for   Extremely Low-Resource Languages Using Parallel Corpora

Takashi Wada; Tomoharu Iwata; Yuji Matsumoto; Timothy Baldwin; Jey Han; Lau

arXiv:2010.14649·cs.CL·October 22, 2021

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han, Lau

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for learning contextualised cross-lingual word embeddings using a small parallel corpus, leveraging an encoder-decoder model that improves performance in extremely low-resource languages and achieves state-of-the-art results in high-resource settings.

Contribution

The paper presents a new encoder-decoder based approach that jointly trains cross-lingual embeddings with shared parameters and combines word and subword information, effective even with minimal data.

Findings

01

Outperforms existing methods in low-resource language tasks

02

Achieves state-of-the-art on German-English word alignment

03

Effective use of shared parameters and subword information

Abstract

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

twadada/multilingual-nlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory