COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Xingang Guo; Fangxu Yu; Huan Zhang; Lianhui Qin; Bin Hu

arXiv:2402.08679·cs.LG·June 10, 2024·5 cites

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces COLD-Attack, a novel framework for controllable jailbreak attacks on large language models, enabling diverse, stealthy, and high-success-rate adversarial attacks through an energy-based decoding approach.

Contribution

It formulates controllable attack generation for LLMs, adapting energy-based decoding methods to automate and unify diverse attack scenarios with controllability constraints.

Findings

01

COLD-Attack achieves high success rates across multiple LLMs.

02

The framework enables diverse attack scenarios including query revision and stealthy insertion.

03

Experiments demonstrate broad applicability and transferability of the attacks.

Abstract

Jailbreaks on large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yu-fangxu/cold-attack
pytorchOfficial

Models

🤗
CTCT-CT2/changeway_guardrails
model· 10 dl· ♡ 2
10 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Cybercrime and Law Enforcement Studies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Cosine Annealing · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · {Dispute@FaQ-s}How to file a dispute with Expedia? · Residual Connection · Adam