Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

TL;DR
This paper introduces Nemotron-CC, a large, high-quality Common Crawl dataset optimized for long-horizon language model training, achieving state-of-the-art results with 15 trillion tokens and improved benchmark performance.
Contribution
It presents novel data filtering and augmentation techniques to create a large, high-quality dataset suitable for long-token horizon training, outperforming prior datasets like DCLM.
Findings
Improved MMLU scores by 5.6 with 1T tokens using high-quality subset.
Full 6.3T dataset matches DCLM on MMLU but with four times more unique tokens.
Training on 15T tokens with Nemotron-CC surpasses Llama 3.1 8B model on multiple benchmarks.
Abstract
Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUFmodel· 18k dl· ♡ 10818k dl♡ 108
- 🤗nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16model· 47k dl· ♡ 6747k dl♡ 67
- 🤗nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8model· 10k dl· ♡ 1910k dl♡ 19
- 🤗nvidia/NVIDIA-Nemotron-Nano-9B-v2model· 429k dl· ♡ 487429k dl♡ 487
- 🤗unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUFmodel· 26k dl· ♡ 5126k dl♡ 51
- 🤗nvidia/NVIDIA-Nemotron-Nano-12B-v2model· 30k dl· ♡ 16130k dl♡ 161
- 🤗cpagac/Nemotron-Nano-9B-v2-hereticmodel· 278 dl· ♡ 3278 dl♡ 3
- 🤗cyankiwi/NVIDIA-Nemotron-Nano-9B-v2-AWQ-4bitmodel· 389 dl· ♡ 3389 dl♡ 3
- 🤗unsloth/NVIDIA-Nemotron-Nano-9B-v2model· 617 dl· ♡ 3617 dl♡ 3
- 🤗tokyotech-llm/Qwen3-Swallow-8B-RL-v0.2model· 4.8k dl· ♡ 94.8k dl♡ 9
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Semantic Web and Ontologies · Natural Language Processing Techniques
MethodsLLaMA
