ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation
Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, and Hongji Zeng

TL;DR
This paper introduces ChiMDQA, a comprehensive Chinese document QA dataset with diverse, high-quality question-answer pairs across multiple domains, supporting advanced NLP tasks and fostering future research.
Contribution
The paper presents ChiMDQA, a large-scale, multi-domain Chinese document QA dataset with fine-grained annotations and a systematic construction methodology, filling a gap in Chinese NLP resources.
Findings
Dataset contains 6,068 QA pairs across six domains.
High-quality, diverse questions with fine-grained classification.
Supports multiple NLP tasks like comprehension and knowledge extraction.
Abstract
With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Text and Document Classification Technologies
