AmaSQuAD: A Benchmark for Amharic Extractive Question Answering

Nebiyou Daniel Hailemariam; Blessed Guda; Tsegazeab Tefferi

arXiv:2502.02047·cs.CL·February 5, 2025·2 cites

AmaSQuAD: A Benchmark for Amharic Extractive Question Answering

Nebiyou Daniel Hailemariam, Blessed Guda, Tsegazeab Tefferi

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AmaSQuAD, a new Amharic extractive question-answering dataset derived from translating SQuAD 2.0, and demonstrates improved model performance using fine-tuning techniques on this low-resource language.

Contribution

It presents a novel framework for translating QA datasets into low-resource languages and fine-tuning models, specifically creating and evaluating AmaSQuAD for Amharic.

Findings

01

F1 score improved from 36.55% to 44.41% on AmaSQuAD dev set

02

F1 score increased from 67.80% to 68.80% on AmQA dataset

03

Model performance shows significant enhancement with fine-tuning on the synthetic dataset

Abstract

This research presents a novel framework for translating extractive question-answering datasets into low-resource languages, as demonstrated by the creation of the AmaSQuAD dataset, a translation of SQuAD 2.0 into Amharic. The methodology addresses challenges related to misalignment between translated questions and answers, as well as the presence of multiple answer instances in the translated context. For this purpose, we used cosine similarity utilizing embeddings from a fine-tuned BERT-based model for Amharic and Longest Common Subsequence (LCS). Additionally, we fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering. The results show an improvement in baseline performance, with the fine-tuned model achieving an increase in the F1 score from 36.55% to 44.41% and 50.01% to 57.5% on the AmaSQuAD development dataset. Moreover, the model demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nebhailema/AmaSquad
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsXLM-R