TaeBench: Improving Quality of Toxic Adversarial Examples
Xuan Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, Yanjun Qi

TL;DR
This paper introduces TaeBench, a high-quality dataset of toxic adversarial examples created through a novel annotation pipeline, which improves the evaluation and robustness of toxicity detection models.
Contribution
It presents a new annotation pipeline for quality control of toxic adversarial examples and curates TaeBench, a large dataset that enhances attack transferability and model robustness.
Findings
TaeBench effectively transfers attacks to toxicity models.
Adversarial training with TaeBench improves model robustness.
Many existing attack samples are invalid, highlighting the need for quality control.
Abstract
Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. Successful TAE should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Software Testing and Debugging Techniques
