CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models
Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin

TL;DR
CCI4.0 is a large, high-quality bilingual pretraining dataset designed to improve reasoning in large language models, emphasizing data quality, diverse reasoning patterns, and enhanced downstream task performance.
Contribution
The paper introduces CCI4.0, a novel bilingual dataset with a rigorous data curation pipeline and diverse Chain-of-Thought templates to enhance reasoning capabilities in LLMs.
Findings
LLMs trained on CCI4.0 show improved performance in math and code tasks.
The dataset's quality filtering reduces hallucinations in reasoning templates.
Empirical results highlight the importance of data quality in LLM training.
Abstract
We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a TB carefully curated Chinese web corpus, a TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Research Data Management Practices
