Toxicity Detection towards Adaptability to Changing Perturbations

Hankun Kang; Jianhao Chen; Yongqi Li; Xin Miao; Mayi Xu; Ming Zhong,; Yuanyuan Zhu; Tieyun Qian

arXiv:2412.15267·cs.CR·March 5, 2025

Toxicity Detection towards Adaptability to Changing Perturbations

Hankun Kang, Jianhao Chen, Yongqi Li, Xin Miao, Mayi Xu, Ming Zhong,, Yuanyuan Zhu, Tieyun Qian

PDF

Open Access

TL;DR

This paper introduces a new dataset and benchmark for toxicity detection that evaluates models' robustness against evolving perturbation patterns, proposing a continual learning approach to improve adaptability to new malicious tactics.

Contribution

It presents a novel dataset with diverse perturbation patterns, systematically evaluates existing methods' vulnerabilities, and proposes a domain incremental learning paradigm for enhanced robustness.

Findings

01

Current methods are vulnerable to new perturbation patterns.

02

The proposed continual learning approach improves detection robustness.

03

Benchmark results highlight the need for adaptive toxicity detection models.

Abstract

Toxicity detection is crucial for maintaining the peace of the society. While existing methods perform well on normal toxic contents or those generated by specific perturbation methods, they are vulnerable to evolving perturbation patterns. However, in real-world scenarios, malicious users tend to create new perturbation patterns for fooling the detectors. For example, some users may circumvent the detector of large language models (LLMs) by adding `I am a scientist' at the beginning of the prompt. In this paper, we introduce a novel problem, i.e., continual learning jailbreak perturbation patterns, into the toxicity detection field. To tackle this problem, we first construct a new dataset generated by 9 types of perturbation patterns, 7 of them are summarized from prior work and 2 of them are developed by us. We then systematically validate the vulnerability of current methods on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods

MethodsAttention Model