PerCQA: Persian Community Question Answering Dataset
Naghme Jamali, Yadollah Yaghoobzadeh, Hesham Faili

TL;DR
PerCQA is the first Persian Community Question Answering dataset, comprising nearly 1,000 questions and over 21,000 answers, designed to facilitate research in Persian CQA tasks using pre-trained language models.
Contribution
This paper introduces PerCQA, the first annotated Persian CQA dataset, and establishes benchmarks for answer selection using advanced language models.
Findings
PerCQA contains 989 questions and 21,915 answers.
Pre-trained language models achieve strong performance on Persian answer selection.
The dataset is publicly available to support Persian NLP research.
Abstract
Community Question Answering (CQA) forums provide answers for many real-life questions. Thanks to the large size, these forums are very popular among machine learning researchers. Automatic answer selection, answer ranking, question retrieval, expert finding, and fact-checking are example learning tasks performed using CQA data. In this paper, we present PerCQA, the first Persian dataset for CQA. This dataset contains the questions and answers crawled from the most well-known Persian forum. After data acquisition, we provide rigorous annotation guidelines in an iterative process, and then the annotation of question-answer pairs in SemEvalCQA format. PerCQA contains 989 questions and 21,915 annotated answers. We make PerCQA publicly available to encourage more research in Persian CQA. We also build strong benchmarks for the task of answer selection in PerCQA by using mono- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling
