Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi; Sanjeev Satheesh; Shrimai Prabhumoye; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

arXiv:2508.15096·cs.CL·August 22, 2025

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

PDF

Open Access 5 Datasets 3 Reviews

TL;DR

Nemotron-CC-Math is a large, high-quality mathematical dataset from web data, created with a novel extraction pipeline, significantly improving math reasoning benchmarks for pretrained language models.

Contribution

Introduces Nemotron-CC-Math, a large-scale, high-quality math dataset from Common Crawl using a robust, domain-agnostic extraction pipeline that preserves mathematical structure and outperforms prior datasets.

Findings

01

Pretraining on Nemotron-CC-Math improves math reasoning benchmarks.

02

The dataset surpasses prior open math datasets in size and quality.

03

Models pretrained on it show significant gains in reasoning tasks.

Abstract

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- Robust and Proper Pipeline: The lynx + LLM-cleaner pipeline is an effective solution. It addresses the failure mode of previous web math extractors that corrupt math and code. The qualitative examples provided clearly demonstrate its superiority in preserving structure. While this method makes sense, both the lynx rendering and an 14B LLM cleaner are expensive for many practitioners. So the open sharing of this resource will help the community significantly. - The paper delivers a dataset that

Weaknesses

This paper can benefit from additional experimental settings. The main experiments are conducted at the mid-train setting. Will there be some confounding factor from the base model itself? Further, would larger amount of unique tokens be helpful and how much repetition can this dataset be used? The readers would also benefit from learning about the filtered out portion, i.e., Nemotron-cc-math-1-3. The token size, quality and corresponding model performance may provide valuable information.

Reviewer 02Rating 8Confidence 4

Strengths

1. I think overall the data pipelines are very sound. the authors combines layout-aware lynx rendering with structure-preserving LLM cleaning, avoiding information loss from naïve HTML-to-text extraction. 2. The authors also unify MathJax/KaTeX/MathML into LaTeX, preserving equation and code structure while removing boilerplate, which is very important but often under-estimated in previous works. 2. The experiments show very promising results, further demonstrating the quality of the datasets.

Weaknesses

I don't see any obvious weaknesses.

Reviewer 03Rating 6Confidence 5

Strengths

1. Well-written and structured paper, solid experiments; 2. The lynx’s introduction, which reliably captures equations and maintains code indentation, avoids the heuristics DOM tree operations, such as MegaMath. 3. The ablation on different refinement models is solid.

Weaknesses

I believe that the effectiveness of Lynx should be evaluated through an apples-to-apples comparison. For example, the quality of Lynx versus DOM tree optimization (as introduced in MegaMath) on the same mathematical web pages could be compared under a controlled setting.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques · Natural Language Processing Techniques