Pchatbot: A Large-Scale Dataset for Personalized Chatbot
Hongjin Qian, Xiaohe Li, Hanxun Zhong, Yu Guo, Yueyuan Ma, Yutao Zhu,, Zhanliang Liu, Zhicheng Dou, Ji-Rong Wen

TL;DR
Pchatbot is a large-scale Chinese dialogue dataset from Weibo and Judicial forums, designed to facilitate personalized chatbot development by including anonymized user IDs and timestamps, enabling models to learn user personalities.
Contribution
The paper introduces Pchatbot, a significantly larger and more detailed Chinese dialogue dataset with anonymized user data, supporting personalized dialogue modeling.
Findings
Benchmark results for state-of-the-art models provided
Dataset's scale surpasses existing Chinese dialogue datasets
Inclusion of user IDs and timestamps enables personalized modeling
Abstract
Natural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
