ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information
Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong, Du, Chengqing Zong, Jiajun Zhang

TL;DR
ChineseWebText 2.0 is a large-scale, high-quality Chinese dataset with multi-dimensional fine-grained annotations, designed to improve LLM training by providing detailed quality, domain, and toxicity information.
Contribution
This paper introduces MDFG-tool, a novel data construction pipeline that creates a comprehensive Chinese text dataset with detailed annotations, surpassing previous datasets in scale and granularity.
Findings
Releases 3.8TB of ChineseWebText 2.0 data
Provides fine-grained labels including quality, domain, and toxicity
Facilitates targeted data selection for LLM training
Abstract
During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques
MethodsFocus
