ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model
Jianghao Chen, Pu Jian, Tengxiao Xi, Dongyi Yi, Qianlong Du, Chenglin, Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, Jiajun Zhang

TL;DR
This paper introduces EvalWeb, a comprehensive tool-chain for extracting and evaluating high-quality Chinese web texts, resulting in the large-scale ChineseWebText dataset to improve LLM pre-training.
Contribution
The paper presents a novel complete tool-chain for extracting and assessing Chinese web data, including a quality evaluation model, and releases the largest high-quality Chinese web text dataset.
Findings
Releases 1.42 TB ChineseWebText dataset with quality scores
Provides a cleaner 600 GB subset with quality > 90%
Facilitates better data selection for Chinese LLM pre-training
Abstract
During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Topic Modeling
MethodsFocus
