
TL;DR
This paper introduces a method to create cross-lingual word embeddings for endangered languages using resources from resource-rich languages, enabling effective sentiment analysis despite scarce data.
Contribution
The authors develop a novel approach for constructing and aligning word embeddings for endangered languages using translation dictionaries and universal dependencies, and build a universal sentiment analysis model.
Findings
Embeddings for endangered languages are well-aligned with resource-rich languages.
The sentiment analysis model achieved high accuracy across multiple languages.
All resources and models are openly available via a Python library.
Abstract
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
