Speak Out of Turn: Safety Vulnerability of Large Language Models in   Multi-turn Dialogue

Zhenhong Zhou; Jiuyang Xiang; Haopeng Chen; Quan Liu; Zherui Li; Sen; Su

arXiv:2402.17262·cs.CL·October 31, 2024·1 cites

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen, Su

PDF

Open Access

TL;DR

This paper reveals that large language models are vulnerable to safety issues in multi-turn dialogues, where malicious users can decompose harmful queries into sub-questions to induce unsafe responses, exposing new safety challenges.

Contribution

It demonstrates that existing safety mechanisms are insufficient in multi-turn dialogues, highlighting a novel vulnerability not addressed in prior single-turn safety studies.

Findings

01

LLMs can be manipulated into generating harmful responses through multi-turn dialogue.

02

Current safety mechanisms fail to prevent incremental harmful responses in complex dialogues.

03

Vulnerabilities are consistent across various large language models.

Abstract

Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Interpreting and Communication in Healthcare