ArabicaQA: A Comprehensive Dataset for Arabic Question Answering
Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud,, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt

TL;DR
This paper introduces ArabicaQA, a large-scale Arabic question answering dataset, along with AraDPR retrieval model and benchmarking of large language models, significantly advancing Arabic NLP resources and capabilities.
Contribution
It presents the first comprehensive Arabic QA dataset, a specialized retrieval model, and benchmarks LLMs, addressing key gaps in Arabic NLP research.
Findings
ArabicaQA contains 89,095 answerable questions.
AraDPR effectively retrieves relevant Arabic passages.
LLMs show promising but varied performance in Arabic QA.
Abstract
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
