ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in   Chinese with Cloaking Perturbations

Yunze Xiao; Yujia Hu; Kenny Tsu Wei Choo; Roy Ka-wei Lee

arXiv:2406.12223·cs.CL·June 19, 2024

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-wei Lee

PDF

Open Access

TL;DR

This paper evaluates the robustness of Chinese offensive language detection models against cloaking perturbations like homophonic substitutions and emojis, revealing significant performance drops and emphasizing the need for more resilient detection methods.

Contribution

Introduces ToxiCloakCN, an augmented dataset with perturbations to test LLM robustness in Chinese offensive language detection, highlighting current model vulnerabilities.

Findings

01

Models significantly underperform with perturbations

02

Different offensive content types are variably affected

03

Human and model explanations of offensiveness show alignment

Abstract

Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Interpreting and Communication in Healthcare · Natural Language Processing Techniques

MethodsFocus