A Large-Scale Chinese Short-Text Conversation Dataset

Yida Wang; Pei Ke; Yinhe Zheng; Kaili Huang; Yong Jiang; Xiaoyan Zhu,; and Minlie Huang

arXiv:2008.03946·cs.CL·April 27, 2022·6 cites

A Large-Scale Chinese Short-Text Conversation Dataset

Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu,, and Minlie Huang

PDF

Open Access 2 Repos 5 Models 3 Datasets

TL;DR

This paper introduces a large-scale, high-quality Chinese conversation dataset, LCCC, with 6.8 million and 12 million dialogues, supporting advancements in neural dialogue generation models.

Contribution

It provides a rigorously cleaned, large-scale Chinese dialogue dataset and pre-trained models, facilitating research in short-text conversation modeling.

Findings

01

Dataset contains 6.8M and 12M dialogues.

02

Rigorous cleaning pipeline ensures high quality.

03

Pre-trained models released for research use.

Abstract

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems