On the Robustness of Knowledge Editing for Detoxification
Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He

TL;DR
This paper critically evaluates the robustness of knowledge editing methods for detoxifying large language models, revealing limitations in their reliability across different scenarios and languages.
Contribution
It introduces a robustness-focused evaluation framework for KE-based detoxification, highlighting common failure modes and limitations in current approaches.
Findings
Pseudo-detoxification is a common failure mode.
Detoxification effectiveness decreases with multiple unsafe edits.
Effectiveness varies across models, objectives, and languages.
Abstract
Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Artificial Intelligence in Healthcare and Education
