Robust Conversational Agents against Imperceptible Toxicity Triggers

Ninareh Mehrabi; Ahmad Beirami; Fred Morstatter; Aram Galstyan

arXiv:2205.02392·cs.CL·May 6, 2022·1 cites

Robust Conversational Agents against Imperceptible Toxicity Triggers

Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, Aram Galstyan

PDF

Open Access 1 Repo

TL;DR

This paper introduces imperceptible adversarial attacks on conversational agents that are coherent and relevant, and proposes a defense mechanism that effectively prevents toxic language generation while maintaining conversation quality.

Contribution

It presents a novel scalable attack method that is imperceptible and effective, along with a defense mechanism that mitigates toxicity without disrupting conversational flow.

Findings

01

Defense reduces toxic language generation effectively.

02

Attacks are imperceptible and scalable.

03

Defense generalizes to other language models.

Abstract

Warning: this paper contains content that maybe offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ninarehm/robust-agents
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection