Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence
Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang

TL;DR
This paper demonstrates that adversarial training with short-length prompts can effectively defend large language models against long-length jailbreak attacks, supported by theoretical analysis and empirical validation.
Contribution
It reveals that aligning LLMs on short adversarial prompts suffices to defend against longer jailbreak attacks, reducing resource costs.
Findings
Short adversarial prompts are effective against long jailbreak attacks.
Theoretical analysis shows robustness depends on the square root of attack length.
Empirical results confirm the effectiveness of short-length adversarial training.
Abstract
Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length , it is enough to align LLMs on prompts with adversarial suffixes of length . Theoretically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Criminal Justice and Corrections Analysis · Cybercrime and Law Enforcement Studies
MethodsLinear Regression · ALIGN
