When Word Embeddings Become Endangered

Khalid Alnajjar

arXiv:2103.13275·cs.CL·March 25, 2021

When Word Embeddings Become Endangered

Khalid Alnajjar

PDF

TL;DR

This paper introduces a method to create cross-lingual word embeddings for endangered languages using resources from resource-rich languages, enabling effective sentiment analysis despite scarce data.

Contribution

The authors develop a novel approach for constructing and aligning word embeddings for endangered languages using translation dictionaries and universal dependencies, and build a universal sentiment analysis model.

Findings

01

Embeddings for endangered languages are well-aligned with resource-rich languages.

02

The sentiment analysis model achieved high accuracy across multiple languages.

03

All resources and models are openly available via a Python library.

Abstract

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.