Meeting the Needs of Low-Resource Languages: The Value of Automatic   Alignments via Pretrained Models

Abteen Ebrahimi; Arya D. McCarthy; Arturo Oncevay; Luis Chiruzzo; John; E. Ortega; Gustavo A. Gim\'enez-Lugo; Rolando Coto-Solano; Katharina Kann

arXiv:2302.07912·cs.CL·February 17, 2023

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John, E. Ortega, Gustavo A. Gim\'enez-Lugo, Rolando Coto-Solano, Katharina Kann

PDF

Open Access 1 Repo

TL;DR

This paper evaluates how well modern multilingual aligners perform on low-resource languages, providing new gold-standard datasets and assessing their effectiveness both directly and in downstream NLP tasks.

Contribution

It introduces gold-standard alignments for several low-resource language pairs and compares the performance of state-of-the-art aligners with traditional methods, including adaptation techniques.

Findings

01

Transformer-based aligners generally outperform traditional models.

02

Traditional and modern aligners remain competitively effective.

03

Model adaptation improves alignment quality for unseen languages.

Abstract

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abteen/alignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems