GneissWeb: Preparing High Quality Data for LLMs at Scale

Hajar Emami Gohari; Swanand Ravindra Kadhe; Syed Yousaf Shah; Constantin Adam; Abdulhamid Adebayo; Praneet Adusumilli; Farhan Ahmed; Nathalie Baracaldo Angel; Santosh Subhashrao Borse; Yuan-Chi Chang; Xuan-Hong Dang; Nirmit Desai; Revital Eres; Ran Iwamoto; Alexei Karve; Yan Koyfman; Wei-Han Lee; Changchang Liu; Boris Lublinsky; Takuyo Ohko; Pablo Pesce; Maroun Touma; Shiqiang Wang; Shalisha Witherspoon; Herbert Woisetschl\"ager; David Wood; Kun-Lung Wu; Issei Yoshida; Syed Zawad; Petros Zerfos; Yi Zhou; and Bishwaranjan Bhattacharjee

arXiv:2502.14907·cs.CL·July 31, 2025

GneissWeb: Preparing High Quality Data for LLMs at Scale

Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah, Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Subhashrao Borse, Yuan-Chi Chang, Xuan-Hong Dang, Nirmit Desai, Revital Eres, Ran Iwamoto, Alexei Karve

PDF

9 Models 3 Datasets 3 Reviews

TL;DR

GneissWeb is a large, high-quality dataset of around 10 trillion tokens designed for training LLMs, outperforming existing datasets in model performance across multiple benchmarks.

Contribution

The paper introduces GneissWeb, a novel large-scale dataset with advanced filtering and deduplication techniques, enabling better training of LLMs at scale.

Findings

01

Models trained on GneissWeb outperform those trained on smaller datasets.

02

GneissWeb achieves a favorable balance between data quality and quantity.

03

Performance improvements are observed across multiple benchmark evaluations.

Abstract

Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper claims novelty in terms of several new filtering metrics being applied. These metrics are supported in the context of their training setup with ablations comparing performance against reasonable baselines. The authors are transparent about their methods and clarify all important details about their setup for data curation and training. The paper also touches on fairness/bias, training efficiency, and downstream LLM performance.

Weaknesses

The ablations and comparisons against FineWeb appear thorough and convincing as a case for how the new specific filtering approach improves on the FineWeb recipe. However, the paper leaves unclear where this stands overall as a contribution to pretraining dataset construction. One significant claim from the paper is that other competing filtering methods (like DC-LM Baseline) are more aggressive in quality filtering, and thus GneissWeb has the advantage of creating a larger pretraining dataset.

Reviewer 02Rating 4Confidence 3

Strengths

- The experimental results show clear improvements at multiple scales. Consistent gains at 1.4B/3B/7B on both high-signal and extended suites. - This paper introduces a practical and scalable pipeline. Sharded exact substring dedup and ensemble filtering implemented with an open Data Prep Kit. The Bloom-filter reproduction path is pragmatic for large-scale users. - Reporting FLOP reductions to a fixed quality target can be helpful.

Weaknesses

- While each component is tested separately, the interactions among filters (and the precise contribution of threshold tuning) are not clearly separated. There is also a risk of overfitting the filtering thresholds to the evaluation sets. - It seems that the pipeline leans on readability heuristics and tokenization extremes. There is limited analysis of the false-positive and false-negative rates and their downstream semantic impact (e.g., removal of niche yet valuable domains). - Many of the da

Reviewer 03Rating 6Confidence 3

Strengths

1. Powerful empirical results across the benchmarks. I really like the scaling plots. Tell a very clear story. 2. The mixing of multiple quality filters is a nice paradigm. 3. The evaluations were particularly rigorous. 4. Good statistical rigor.

Weaknesses

None really. Small things: 1. Could have discussed the computational cost of experiments. 2. Starting from FineWeb is good, but maybe could have ablated different sources. 3. Math/code performance is ok, but not stellar.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training