Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Gowtham; Sai Rupesh; Sanjay Kumar; Saravanan; Venkata Chaithanya

arXiv:2511.18054·cs.CL·December 4, 2025

Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

PDF

Open Access

TL;DR

Blu-WERP is a scalable preprocessing pipeline that significantly enhances the quality of web-scraped data for large language models, leading to improved performance across multiple benchmarks and model sizes.

Contribution

Introduces Blu-WERP, a novel, efficient data preprocessing pipeline that outperforms existing methods in cleaning web-scale corpora for LLM training.

Findings

01

Blu-WERP achieves up to 9.5% improvement over baselines.

02

Consistently outperforms across all model scales and benchmarks.

03

Reduces computational cost while enhancing data quality.

Abstract

High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods