TL;DR
This paper introduces AfriMTEB, a comprehensive benchmark for African languages, and AfriE5, a new adapted embedding model, significantly improving NLP tasks for underrepresented African languages.
Contribution
The paper expands the MMTEB benchmark with 14 African languages and new tasks, and adapts the mE5 model to African languages, achieving state-of-the-art results.
Findings
AfriMTEB covers 59 languages and 14 tasks, including new datasets.
AfriE5 outperforms existing models like Gemini-Embeddings and mE5.
The adapted model improves NLP performance on African languages.
Abstract
Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
