ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with   Effective Evaluation Model

Jianghao Chen; Pu Jian; Tengxiao Xi; Dongyi Yi; Qianlong Du; Chenglin; Ding; Guibo Zhu; Chengqing Zong; Jinqiao Wang; Jiajun Zhang

arXiv:2311.01149·cs.CL·November 13, 2023·1 cites

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Jianghao Chen, Pu Jian, Tengxiao Xi, Dongyi Yi, Qianlong Du, Chenglin, Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, Jiajun Zhang

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces EvalWeb, a comprehensive tool-chain for extracting and evaluating high-quality Chinese web texts, resulting in the large-scale ChineseWebText dataset to improve LLM pre-training.

Contribution

The paper presents a novel complete tool-chain for extracting and assessing Chinese web data, including a quality evaluation model, and releases the largest high-quality Chinese web text dataset.

Findings

01

Releases 1.42 TB ChineseWebText dataset with quality scores

02

Provides a cleaner 600 GB subset with quality > 90%

03

Facilitates better data selection for Chinese LLM pre-training

Abstract

During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

casia-lm/chinesewebtext
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Topic Modeling

MethodsFocus