Building a Rich Dataset to Empower the Persian Question Answering Systems
Mohsen Yazdinejad, Marjan Kaedi

TL;DR
This paper introduces NextQuAD, a comprehensive Persian question answering dataset, and demonstrates that a BERT-based model trained on it achieves high accuracy, improving performance on related datasets.
Contribution
The creation of the first large-scale Persian QA dataset, NextQuAD, and the development of a BERT-based model that outperforms existing models on Persian QA benchmarks.
Findings
NextQuAD contains 7,515 contexts and 23,918 questions.
Ensembling ParsBERT and XLM-RoBERTa improves QA accuracy.
Model trained on NextQuAD enhances performance on other Persian datasets.
Abstract
Question answering systems provide short, precise, and specific answers to questions. So far, many robust question answering systems have been developed for English, while some languages with fewer resources, like Persian, have few numbers of standard dataset. In this study, a comprehensive open-domain dataset is presented for Persian. This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers. Then, a BERT-based question answering model has been applied to this dataset using two pre-trained language models, including ParsBERT and XLM-RoBERTa. The results of these two models have been ensembled using mean logits. Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with other Persian datasets, our trained model on the NextQuAD, is evaluated on two other datasets named PersianQA and ParSQuAD.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsSparse Evolutionary Training
