AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina Espa\~na-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines

TL;DR
This paper presents AFRIDOC-MT, a new document-level translation dataset for African languages, and evaluates various neural and large language models, revealing strengths and limitations in translating these languages at the document level.
Contribution
Introduces AFRIDOC-MT, a comprehensive African language translation dataset, and benchmarks NMT and LLM performance on document translation tasks involving African languages.
Findings
NLLB-200 achieved the best average NMT performance
GPT-4o outperformed general-purpose LLMs in translation quality
Models fine-tuned on documents improved performance, but sentence-trained models struggled with longer texts
Abstract
This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
