Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie, Guo, Jian Xu, Guanjun Jiang, Luxi Xing, Ping Yang

TL;DR
This paper introduces Multi-CPR, a large multi-domain Chinese dataset for passage retrieval, highlighting the importance of domain-specific data for improving retrieval performance in Chinese language systems.
Contribution
The paper presents a novel multi-domain Chinese passage retrieval dataset, Multi-CPR, with millions of passages and human-annotated query-passage pairs across three domains.
Findings
Models trained on general domain data perform poorly on specific domains.
In-domain training significantly improves retrieval accuracy.
The dataset benchmarks Chinese passage retrieval in various domains.
Abstract
Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially for specific domains, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
