Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora
Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han, Lau

TL;DR
This paper introduces a novel method for learning contextualised cross-lingual word embeddings using a small parallel corpus, leveraging an encoder-decoder model that improves performance in extremely low-resource languages and achieves state-of-the-art results in high-resource settings.
Contribution
The paper presents a new encoder-decoder based approach that jointly trains cross-lingual embeddings with shared parameters and combines word and subword information, effective even with minimal data.
Findings
Outperforms existing methods in low-resource language tasks
Achieves state-of-the-art on German-English word alignment
Effective use of shared parameters and subword information
Abstract
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
