Probing the Safety Response Boundary of Large Language Models via Unsafe   Decoding Path Generation

Haoyu Wang; Bingzhe Wu; Yatao Bian; Yongzhe Chang; Xueqian Wang,; Peilin Zhao

arXiv:2408.10668·cs.CR·August 27, 2024

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Haoyu Wang, Bingzhe Wu, Yatao Bian, Yongzhe Chang, Xueqian Wang,, Peilin Zhao

PDF

Open Access

TL;DR

This paper investigates the vulnerabilities of large language models' safety mechanisms by introducing a decoding strategy called Jailbreak Value Decoding, revealing hidden risks of generating harmful content despite safety measures.

Contribution

The paper proposes a novel decoding approach using a cost value model to identify and exploit safety weaknesses in large language models.

Findings

01

LLaMA-2-chat 7B outputs 39.18% toxic content without safeguards

02

The proposed JVD method can successfully induce unsafe outputs

03

Safety measures may not be sufficient to prevent covert harmful content generation

Abstract

Large Language Models (LLMs) are implicit troublemakers. While they provide valuable insights and assist in problem-solving, they can also potentially serve as a resource for malicious activities. Implementing safety alignment could mitigate the risk of LLMs generating harmful responses. We argue that: even when an LLM appears to successfully block harmful queries, there may still be hidden vulnerabilities that could act as ticking time bombs. To identify these underlying weaknesses, we propose to use a cost value model as both a detector and an attacker. Trained on external or self-generated harmful datasets, the cost value model could successfully influence the original safe LLM to output toxic content in decoding process. For instance, LLaMA-2-chat 7B outputs 39.18% concrete toxic content, along with only 22.16% refusals without any harmful suffixes. These potential weaknesses can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques