AmaSQuAD: A Benchmark for Amharic Extractive Question Answering
Nebiyou Daniel Hailemariam, Blessed Guda, Tsegazeab Tefferi

TL;DR
This paper introduces AmaSQuAD, a new Amharic extractive question-answering dataset derived from translating SQuAD 2.0, and demonstrates improved model performance using fine-tuning techniques on this low-resource language.
Contribution
It presents a novel framework for translating QA datasets into low-resource languages and fine-tuning models, specifically creating and evaluating AmaSQuAD for Amharic.
Findings
F1 score improved from 36.55% to 44.41% on AmaSQuAD dev set
F1 score increased from 67.80% to 68.80% on AmQA dataset
Model performance shows significant enhancement with fine-tuning on the synthetic dataset
Abstract
This research presents a novel framework for translating extractive question-answering datasets into low-resource languages, as demonstrated by the creation of the AmaSQuAD dataset, a translation of SQuAD 2.0 into Amharic. The methodology addresses challenges related to misalignment between translated questions and answers, as well as the presence of multiple answer instances in the translated context. For this purpose, we used cosine similarity utilizing embeddings from a fine-tuned BERT-based model for Amharic and Longest Common Subsequence (LCS). Additionally, we fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering. The results show an improvement in baseline performance, with the fine-tuned model achieving an increase in the F1 score from 36.55% to 44.41% and 50.01% to 57.5% on the AmaSQuAD development dataset. Moreover, the model demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsXLM-R
