CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for   pre-training large language models

Liangdong Wang; Bo-Wen Zhang; Chengwei Wu; Hanyu Zhao; Xiaofeng Shi,; Shuhao Gu; Jijie Li; Quanyue Ma; TengFei Pan; Guang Liu

arXiv:2410.18505·cs.CL·October 28, 2024

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Liangdong Wang, Bo-Wen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi,, Shuhao Gu, Jijie Li, Quanyue Ma, TengFei Pan, Guang Liu

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces CCI3.0-HQ, a large-scale high-quality Chinese dataset designed for pre-training large language models, demonstrating its effectiveness through training a 0.5B parameter model that outperforms previous datasets on multiple benchmarks.

Contribution

The paper presents a novel two-stage hybrid filtering pipeline to create a high-quality 500GB Chinese dataset and shows its effectiveness in training a competitive language model.

Findings

01

0.5B model trained on CCI3.0-HQ outperforms previous datasets on 10 benchmarks.

02

Filtering process distills capabilities of larger models into smaller ones.

03

Open-access dataset facilitates broader access to high-quality Chinese language models.

Abstract

We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
BAAI/CCI3-HQ-Intermediate-Checkpoints
model· ♡ 2
♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques