Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval
Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke

TL;DR
This paper introduces Amharic-specific dense retrieval models based on transformer architectures, achieving significant improvements over multilingual baselines and providing benchmarks and resources for low-resource Amharic passage retrieval.
Contribution
It presents new Amharic-specific dense retrieval models, benchmarks, and resources, addressing the gap in low-resource language retrieval effectiveness.
Findings
RoBERTa-Amharic-Embed outperforms multilingual baseline by 17.6% in MRR@10
Compact models remain competitive with over 13x smaller size
ColBERT-based model achieves highest MRR@10 score of 0.843
Abstract
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗rasyosef/bert-amharic-text-embedding-mediummodel· 513 dl513 dl
- 🤗rasyosef/roberta-amharic-text-embedding-basemodel· 31 dl31 dl
- 🤗rasyosef/roberta-amharic-text-embedding-mediummodel· 12 dl12 dl
- 🤗rasyosef/snowflake-arctic-embed-l-v2.0-finetuned-amharicmodel· 3 dl3 dl
- 🤗rasyosef/colbert-bert-amharic-mediummodel· 2 dl2 dl
- 🤗rasyosef/colbert-roberta-amharic-mediummodel
- 🤗rasyosef/colbert-roberta-amharic-basemodel
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Layer Normalization · Dropout
