Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Sai Krishna Mendu; Harish Yenala; Aditi Gulati; Shanu Kumar; Parag Agrawal

arXiv:2505.02009·cs.CL·August 14, 2025

Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal

PDF

1 Repo 1 Models 2 Datasets

TL;DR

This paper analyzes harmful content in large web datasets used for pretraining language models, introduces a taxonomy and filtering models, and provides benchmarks to promote safer and more responsible LLM development.

Contribution

It offers a comprehensive taxonomy of harmful content, introduces HarmFormer for filtering, and creates benchmarks like HAVOC to improve safety in LLM pretraining.

Findings

01

HarmFormer effectively filters harmful content from datasets.

02

The HAVOC benchmark assesses model responses to toxic inputs.

03

Analysis reveals significant presence of harmful content in web datasets.

Abstract

Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

themendu/TowardsSaferPretraining
noneOfficial

Models

🤗
themendu/HarmFormer
model· 16 dl· ♡ 2
16 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.