Finetuning a Kalaallisut-English machine translation system using   web-crawled data

Alex Jones

arXiv:2206.02230·cs.CL·June 7, 2022

Finetuning a Kalaallisut-English machine translation system using web-crawled data

Alex Jones

PDF

Open Access

TL;DR

This paper explores finetuning a neural machine translation system for Kalaallisut-English using web-crawled data, highlighting challenges and potential for future improvements in low-resource language translation.

Contribution

It introduces a method for mining pseudoparallel data from web sources to improve Kalaallisut-English translation models, and discusses future directions for low-resource language MT.

Findings

01

The dataset was too small and noisy to significantly improve the model.

02

Web-crawled pseudoparallel data has potential but requires more resources.

03

Openly shared code and data facilitate future research.

Abstract

West Greenlandic, known by native speakers as Kalaallisut, is an extremely low-resource polysynthetic language spoken by around 56,000 people in Greenland. Here, we attempt to finetune a pretrained Kalaallisut-to-English neural machine translation (NMT) system using web-crawled pseudoparallel sentences from around 30 multilingual websites. We compile a corpus of over 93,000 Kalaallisut sentences and over 140,000 Danish sentences, then use cross-lingual sentence embeddings and approximate nearest-neighbors search in an attempt to mine near-translations from these corpora. Finally, we translate the Danish sentence to English to obtain a synthetic Kalaallisut-English aligned corpus. Although the resulting dataset is too small and noisy to improve the pretrained MT model, we believe that with additional resources, we could construct a better pseudoparallel corpus and achieve more promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis