Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets
Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

TL;DR
Blu-WERP is a scalable preprocessing pipeline that significantly enhances the quality of web-scraped data for large language models, leading to improved performance across multiple benchmarks and model sizes.
Contribution
Introduces Blu-WERP, a novel, efficient data preprocessing pipeline that outperforms existing methods in cleaning web-scale corpora for LLM training.
Findings
Blu-WERP achieves up to 9.5% improvement over baselines.
Consistently outperforms across all model scales and benchmarks.
Reduces computational cost while enhancing data quality.
Abstract
High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
