ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-wei Lee

TL;DR
This paper evaluates the robustness of Chinese offensive language detection models against cloaking perturbations like homophonic substitutions and emojis, revealing significant performance drops and emphasizing the need for more resilient detection methods.
Contribution
Introduces ToxiCloakCN, an augmented dataset with perturbations to test LLM robustness in Chinese offensive language detection, highlighting current model vulnerabilities.
Findings
Models significantly underperform with perturbations
Different offensive content types are variably affected
Human and model explanations of offensiveness show alignment
Abstract
Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Interpreting and Communication in Healthcare · Natural Language Processing Techniques
MethodsFocus
