AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
Tilahun Yeshambel, Moncef Garouani, Josiane Mothe

TL;DR
This paper introduces two Amharic datasets supporting neural retrieval and instruction-following tasks, addressing data scarcity in low-resource languages and enabling research in retrieval ranking and generative modeling.
Contribution
It provides the first large-scale, high-quality Amharic datasets for neural retrieval and instruction tuning, with detailed methodology for dataset creation applicable to other low-resource languages.
Findings
Datasets enable improved neural retrieval and instruction-following in Amharic.
Manual validation ensures high-quality data for training and benchmarking.
Methodology can be adapted for other low-resource languages.
Abstract
Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
