Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models
Kaituo Zhang, Zhimeng Jiang, Na Zou

TL;DR
This paper presents a fully self-reflective detoxification framework for large language models that leverages their inherent abilities to detect and correct toxic content without external modules or human intervention, improving safety and coherence.
Contribution
It introduces a novel self-reflective detoxification method utilizing internal mechanisms of LLMs, eliminating reliance on external tools or data annotation for safer text generation.
Findings
Outperforms state-of-the-art detoxification methods on benchmark datasets.
Preserves semantic fidelity while reducing toxicity effectively.
Demonstrates intrinsic self-detoxification capabilities of LLMs.
Abstract
Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Computational and Text Analysis Methods
