Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Jingyi Kang; Junyu Lu; Bo Xu; Hongbo Wang; Linlin zong; Roy Ka-Wei Lee; Hongfei Lin

arXiv:2605.22258·cs.CL·May 22, 2026

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin

PDF

TL;DR

This paper introduces CITA, a framework for evaluating and improving Chinese toxicity detection by generating implicit and obfuscated toxic language samples, revealing detection vulnerabilities and enhancing model robustness.

Contribution

The paper presents CITA, a novel controlled red-team framework for Chinese toxicity evaluation that emphasizes implicitness and obfuscation, and demonstrates its effectiveness in revealing detector weaknesses and training defenses.

Findings

01

Detectors missed 69.48% of CITA-generated toxic samples.

02

Human evaluators confirmed increased implicitness and harmfulness.

03

Fine-tuning with CITA data improved robustness of toxicity detectors.

Abstract

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.