Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Kidist Amde Mekonnen; Yosef Worku Alemneh; Maarten de Rijke

arXiv:2505.19356·cs.IR·June 11, 2025

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke

PDF

Open Access 1 Repo 7 Models 2 Datasets 1 Video

TL;DR

This paper introduces Amharic-specific dense retrieval models based on transformer architectures, achieving significant improvements over multilingual baselines and providing benchmarks and resources for low-resource Amharic passage retrieval.

Contribution

It presents new Amharic-specific dense retrieval models, benchmarks, and resources, addressing the gap in low-resource language retrieval effectiveness.

Findings

01

RoBERTa-Amharic-Embed outperforms multilingual baseline by 17.6% in MRR@10

02

Compact models remain competitive with over 13x smaller size

03

ColBERT-based model achieves highest MRR@10 score of 0.843

Abstract

Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kidist-amde/amharic-ir-benchmarks
pytorchOfficial

Models

Datasets

Videos

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Layer Normalization · Dropout