Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots
Bocheng Chen, Guangjing Wang, Hanqing Guo, Yuanda Wang, Qiben Yan

TL;DR
This paper investigates how open-domain chatbots can generate toxic responses during multi-turn conversations, revealing vulnerabilities and proposing a new attack method, oxicbot, to test and improve chatbot safety.
Contribution
It introduces oxicbot, a fine-tuning based attack that effectively triggers toxicity in chatbots during multi-turn dialogues, highlighting limitations of current defenses.
Findings
oxicbot achieves a 67% activation rate in triggering toxicity.
Existing tools fail to identify 82% of sentences that lead to toxic responses.
The attack bypasses two existing toxicity defense methods.
Abstract
Recent advances in natural language processing and machine learning have led to the development of chatbot models, such as ChatGPT, that can engage in conversational dialogue with human users. However, the ability of these models to generate toxic or harmful responses during a non-toxic multi-turn conversation remains an open research question. Existing research focuses on single-turn sentence testing, while we find that 82\% of the individual non-toxic sentences that elicit toxic behaviors in a conversation are considered safe by existing tools. In this paper, we design a new attack, \toxicbot, by fine-tuning a chatbot to engage in conversation with a target open-domain chatbot. The chatbot is fine-tuned with a collection of crafted conversation sequences. Particularly, each conversation begins with a sentence from a crafted prompt sentences dataset. Our extensive evaluation shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
