Dialectal and Low-Resource Machine Translation for Aromanian

Alexandru-Iulius Jerpelea; Alina R\u{a}doi; Sergiu Nisioi

arXiv:2410.17728·cs.CL·January 8, 2025

Dialectal and Low-Resource Machine Translation for Aromanian

Alexandru-Iulius Jerpelea, Alina R\u{a}doi, Sergiu Nisioi

PDF

Open Access 1 Datasets

TL;DR

This paper develops a neural machine translation system for Aromanian, creating the largest parallel corpus and analyzing models to support language preservation and computational linguistics.

Contribution

It introduces the largest Aromanian-Romanian corpus and compares multiple translation models, along with auxiliary tools for language processing.

Findings

01

Created a 79,000 sentence pair corpus for Aromanian-Romanian

02

Compared several translation models optimized for Aromanian

03

Provided publicly available datasets and tools for language preservation

Abstract

This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

aronlp/aromanian-romanian-MT-corpus
dataset· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Linguistics, Language Diversity, and Identity