Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John, E. Ortega, Gustavo A. Gim\'enez-Lugo, Rolando Coto-Solano, Katharina Kann

TL;DR
This paper evaluates how well modern multilingual aligners perform on low-resource languages, providing new gold-standard datasets and assessing their effectiveness both directly and in downstream NLP tasks.
Contribution
It introduces gold-standard alignments for several low-resource language pairs and compares the performance of state-of-the-art aligners with traditional methods, including adaptation techniques.
Findings
Transformer-based aligners generally outperform traditional models.
Traditional and modern aligners remain competitively effective.
Model adaptation improves alignment quality for unseen languages.
Abstract
Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
