GneissWeb: Preparing High Quality Data for LLMs at Scale
Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah, Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Subhashrao Borse, Yuan-Chi Chang, Xuan-Hong Dang, Nirmit Desai, Revital Eres, Ran Iwamoto, Alexei Karve

TL;DR
GneissWeb is a large, high-quality dataset of around 10 trillion tokens designed for training LLMs, outperforming existing datasets in model performance across multiple benchmarks.
Contribution
The paper introduces GneissWeb, a novel large-scale dataset with advanced filtering and deduplication techniques, enabling better training of LLMs at scale.
Findings
Models trained on GneissWeb outperform those trained on smaller datasets.
GneissWeb achieves a favorable balance between data quality and quantity.
Performance improvements are observed across multiple benchmark evaluations.
Abstract
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on…
Peer Reviews
Decision·ICLR 2026 Poster
The paper claims novelty in terms of several new filtering metrics being applied. These metrics are supported in the context of their training setup with ablations comparing performance against reasonable baselines. The authors are transparent about their methods and clarify all important details about their setup for data curation and training. The paper also touches on fairness/bias, training efficiency, and downstream LLM performance.
The ablations and comparisons against FineWeb appear thorough and convincing as a case for how the new specific filtering approach improves on the FineWeb recipe. However, the paper leaves unclear where this stands overall as a contribution to pretraining dataset construction. One significant claim from the paper is that other competing filtering methods (like DC-LM Baseline) are more aggressive in quality filtering, and thus GneissWeb has the advantage of creating a larger pretraining dataset.
- The experimental results show clear improvements at multiple scales. Consistent gains at 1.4B/3B/7B on both high-signal and extended suites. - This paper introduces a practical and scalable pipeline. Sharded exact substring dedup and ensemble filtering implemented with an open Data Prep Kit. The Bloom-filter reproduction path is pragmatic for large-scale users. - Reporting FLOP reductions to a fixed quality target can be helpful.
- While each component is tested separately, the interactions among filters (and the precise contribution of threshold tuning) are not clearly separated. There is also a risk of overfitting the filtering thresholds to the evaluation sets. - It seems that the pipeline leans on readability heuristics and tokenization extremes. There is limited analysis of the false-positive and false-negative rates and their downstream semantic impact (e.g., removal of niche yet valuable domains). - Many of the da
1. Powerful empirical results across the benchmarks. I really like the scaling plots. Tell a very clear story. 2. The mixing of multiple quality filters is a nice paradigm. 3. The evaluations were particularly rigorous. 4. Good statistical rigor.
None really. Small things: 1. Could have discussed the computational cost of experiments. 2. Starting from FineWeb is good, but maybe could have ablated different sources. 3. Math/code performance is ok, but not stellar.
Code & Models
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_FineWeb.seed1model· 21 dl21 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_GneissWeb.seed1model· 12 dl12 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_FineWeb.Edu.seed1model· 18 dl18 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_FineWeb.seed2model· 15 dl15 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_GneissWeb.seed2model· 18 dl18 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_FineWeb.Edu.seed2model· 19 dl19 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_FineWeb.seed3model· 13 dl13 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_GneissWeb.seed3model· 13 dl13 dl
- 🤗ibm-granite/GneissWeb.7B_ablation_model_on_350B_FineWeb.Edu.seed3model· 17 dl17 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
