CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models
Liangdong Wang, Bo-Wen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi,, Shuhao Gu, Jijie Li, Quanyue Ma, TengFei Pan, Guang Liu

TL;DR
This paper introduces CCI3.0-HQ, a large-scale high-quality Chinese dataset designed for pre-training large language models, demonstrating its effectiveness through training a 0.5B parameter model that outperforms previous datasets on multiple benchmarks.
Contribution
The paper presents a novel two-stage hybrid filtering pipeline to create a high-quality 500GB Chinese dataset and shows its effectiveness in training a competitive language model.
Findings
0.5B model trained on CCI3.0-HQ outperforms previous datasets on 10 benchmarks.
Filtering process distills capabilities of larger models into smaller ones.
Open-access dataset facilitates broader access to high-quality Chinese language models.
Abstract
We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
