On the Robustness of Knowledge Editing for Detoxification

Ming Dong; Shiyi Tang; Ziyan Peng; Guanyi Chen; Tingting He

arXiv:2602.10504·cs.CL·February 12, 2026

On the Robustness of Knowledge Editing for Detoxification

Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He

PDF

Open Access

TL;DR

This paper critically evaluates the robustness of knowledge editing methods for detoxifying large language models, revealing limitations in their reliability across different scenarios and languages.

Contribution

It introduces a robustness-focused evaluation framework for KE-based detoxification, highlighting common failure modes and limitations in current approaches.

Findings

01

Pseudo-detoxification is a common failure mode.

02

Detoxification effectiveness decreases with multiple unsafe edits.

03

Effectiveness varies across models, objectives, and languages.

Abstract

Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Artificial Intelligence in Healthcare and Education