Finetuning a Kalaallisut-English machine translation system using web-crawled data
Alex Jones

TL;DR
This paper explores finetuning a neural machine translation system for Kalaallisut-English using web-crawled data, highlighting challenges and potential for future improvements in low-resource language translation.
Contribution
It introduces a method for mining pseudoparallel data from web sources to improve Kalaallisut-English translation models, and discusses future directions for low-resource language MT.
Findings
The dataset was too small and noisy to significantly improve the model.
Web-crawled pseudoparallel data has potential but requires more resources.
Openly shared code and data facilitate future research.
Abstract
West Greenlandic, known by native speakers as Kalaallisut, is an extremely low-resource polysynthetic language spoken by around 56,000 people in Greenland. Here, we attempt to finetune a pretrained Kalaallisut-to-English neural machine translation (NMT) system using web-crawled pseudoparallel sentences from around 30 multilingual websites. We compile a corpus of over 93,000 Kalaallisut sentences and over 140,000 Danish sentences, then use cross-lingual sentence embeddings and approximate nearest-neighbors search in an attempt to mine near-translations from these corpora. Finally, we translate the Danish sentence to English to obtain a synthetic Kalaallisut-English aligned corpus. Although the resulting dataset is too small and noisy to improve the pretrained MT model, we believe that with additional resources, we could construct a better pseudoparallel corpus and achieve more promising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
