Expanding FLORES+ Benchmark for more Low-Resource Settings:   Portuguese-Emakhuwa Machine Translation Evaluation

Felermino D. M. Antonio Ali; Henrique Lopes Cardoso; Rui Sousa-Silva

arXiv:2408.11457·cs.CL·August 22, 2024

Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

Felermino D. M. Antonio Ali, Henrique Lopes Cardoso, Rui Sousa-Silva

PDF

Open Access 1 Video

TL;DR

This paper expands the FLORES+ benchmark to include Emakhuwa, a low-resource language, by translating datasets and evaluating machine translation models, highlighting challenges and the need for further improvements.

Contribution

It introduces a new Emakhuwa dataset for low-resource translation evaluation and provides baseline translation results, facilitating future research in this language.

Findings

01

Baseline models underperform on Emakhuwa translation tasks.

02

Spelling inconsistencies significantly affect translation quality.

03

Further research is needed to improve machine translation for Emakhuwa.

Abstract

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Expanding FLORES+ Benchmark for More Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training