Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Kaituo Zhang; Zhimeng Jiang; Na Zou

arXiv:2601.11776·cs.CL·January 21, 2026

Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Kaituo Zhang, Zhimeng Jiang, Na Zou

PDF

Open Access

TL;DR

This paper presents a fully self-reflective detoxification framework for large language models that leverages their inherent abilities to detect and correct toxic content without external modules or human intervention, improving safety and coherence.

Contribution

It introduces a novel self-reflective detoxification method utilizing internal mechanisms of LLMs, eliminating reliance on external tools or data annotation for safer text generation.

Findings

01

Outperforms state-of-the-art detoxification methods on benchmark datasets.

02

Preserves semantic fidelity while reducing toxicity effectively.

03

Demonstrates intrinsic self-detoxification capabilities of LLMs.

Abstract

Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Computational and Text Analysis Methods