COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu

TL;DR
This paper introduces COLD-Attack, a novel framework for controllable jailbreak attacks on large language models, enabling diverse, stealthy, and high-success-rate adversarial attacks through an energy-based decoding approach.
Contribution
It formulates controllable attack generation for LLMs, adapting energy-based decoding methods to automate and unify diverse attack scenarios with controllability constraints.
Findings
COLD-Attack achieves high success rates across multiple LLMs.
The framework enables diverse attack scenarios including query revision and stealthy insertion.
Experiments demonstrate broad applicability and transferability of the attacks.
Abstract
Jailbreaks on large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Cybercrime and Law Enforcement Studies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Cosine Annealing · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · {Dispute@FaQ-s}How to file a dispute with Expedia? · Residual Connection · Adam
