Building a Rich Dataset to Empower the Persian Question Answering   Systems

Mohsen Yazdinejad; Marjan Kaedi

arXiv:2412.20212·cs.CL·December 31, 2024

Building a Rich Dataset to Empower the Persian Question Answering Systems

Mohsen Yazdinejad, Marjan Kaedi

PDF

Open Access

TL;DR

This paper introduces NextQuAD, a comprehensive Persian question answering dataset, and demonstrates that a BERT-based model trained on it achieves high accuracy, improving performance on related datasets.

Contribution

The creation of the first large-scale Persian QA dataset, NextQuAD, and the development of a BERT-based model that outperforms existing models on Persian QA benchmarks.

Findings

01

NextQuAD contains 7,515 contexts and 23,918 questions.

02

Ensembling ParsBERT and XLM-RoBERTa improves QA accuracy.

03

Model trained on NextQuAD enhances performance on other Persian datasets.

Abstract

Question answering systems provide short, precise, and specific answers to questions. So far, many robust question answering systems have been developed for English, while some languages with fewer resources, like Persian, have few numbers of standard dataset. In this study, a comprehensive open-domain dataset is presented for Persian. This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers. Then, a BERT-based question answering model has been applied to this dataset using two pre-trained language models, including ParsBERT and XLM-RoBERTa. The results of these two models have been ensembled using mean logits. Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with other Persian datasets, our trained model on the NextQuAD, is evaluated on two other datasets named PersianQA and ParSQuAD.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSparse Evolutionary Training