CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Guang Liu; Liangdong Wang; Jijie Li; Yang Yu; Yao Xu; Jiabei Chen; Yu Bai; Feng Liao; Yonghua Lin

arXiv:2506.07463·cs.CL·June 10, 2025

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin

PDF

Open Access 3 Datasets

TL;DR

CCI4.0 is a large, high-quality bilingual pretraining dataset designed to improve reasoning in large language models, emphasizing data quality, diverse reasoning patterns, and enhanced downstream task performance.

Contribution

The paper introduces CCI4.0, a novel bilingual dataset with a rigorous data curation pipeline and diverse Chain-of-Thought templates to enhance reasoning capabilities in LLMs.

Findings

01

LLMs trained on CCI4.0 show improved performance in math and code tasks.

02

The dataset's quality filtering reduces hallucinations in reasoning templates.

03

Empirical results highlight the importance of data quality in LLM training.

Abstract

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Research Data Management Practices