ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with   Multi-dimensional and fine-grained information

Wanyue Zhang; Ziyong Li; Wen Yang; Chunlin Leng; Yinan Bai; Qianlong; Du; Chengqing Zong; Jiajun Zhang

arXiv:2411.19668·cs.CL·December 2, 2024

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong, Du, Chengqing Zong, Jiajun Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

ChineseWebText 2.0 is a large-scale, high-quality Chinese dataset with multi-dimensional fine-grained annotations, designed to improve LLM training by providing detailed quality, domain, and toxicity information.

Contribution

This paper introduces MDFG-tool, a novel data construction pipeline that creates a comprehensive Chinese text dataset with detailed annotations, surpassing previous datasets in scale and granularity.

Findings

01

Releases 3.8TB of ChineseWebText 2.0 data

02

Provides fine-grained labels including quality, domain, and toxicity

03

Facilitates targeted data selection for LLM training

Abstract

During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

casia-lm/chinesewebtext-2.0
pytorchOfficial

Datasets

CASIA-LM/ChineseWebText2.0
dataset· 3.8k dl
3.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques

MethodsFocus