TL;DR
This paper introduces HSPD, a novel dataset detoxification method that rewrites toxic spans in raw data to reduce model toxicity effectively, outperforming existing approaches across multiple LLMs.
Contribution
The paper presents HSPD, a new semantics-preserving detoxification pipeline that directly cleans training data, leading to significant toxicity reduction in large language models.
Findings
HSPD reduces Toxicity Probability from 0.42 to 0.18 on GPT2-XL.
HSPD achieves state-of-the-art detoxification results on multiple LLMs.
The method maintains data utility while effectively suppressing toxicity.
Abstract
Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
