Detoxification for LLM: From Dataset Itself

Wei Shao; Yihang Wang; Gaoyu Zhu; Ziqiang Cheng; Lei Yu; Jiafeng Guo; Xueqi Cheng

arXiv:2604.19124·cs.CL·April 22, 2026

Detoxification for LLM: From Dataset Itself

Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng

PDF

1 Repo

TL;DR

This paper introduces HSPD, a novel dataset detoxification method that rewrites toxic spans in raw data to reduce model toxicity effectively, outperforming existing approaches across multiple LLMs.

Contribution

The paper presents HSPD, a new semantics-preserving detoxification pipeline that directly cleans training data, leading to significant toxicity reduction in large language models.

Findings

01

HSPD reduces Toxicity Probability from 0.42 to 0.18 on GPT2-XL.

02

HSPD achieves state-of-the-art detoxification results on multiple LLMs.

03

The method maintains data utility while effectively suppressing toxicity.

Abstract

Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ntsw2001/data_detox_for_llm
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.