Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

Shaopeng Fu; Liang Ding; Jingfeng Zhang; Di Wang

arXiv:2502.04204·cs.LG·February 3, 2026

Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that adversarial training with short-length prompts can effectively defend large language models against long-length jailbreak attacks, supported by theoretical analysis and empirical validation.

Contribution

It reveals that aligning LLMs on short adversarial prompts suffices to defend against longer jailbreak attacks, reducing resource costs.

Findings

01

Short adversarial prompts are effective against long jailbreak attacks.

02

Theoretical analysis shows robustness depends on the square root of attack length.

03

Empirical results confirm the effectiveness of short-length adversarial training.

Abstract

Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Θ (M)$ , it is enough to align LLMs on prompts with adversarial suffixes of length $Θ (M)$ . Theoretically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fshp971/adv-icl
pytorchOfficial

Videos

Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Criminal Justice and Corrections Analysis · Cybercrime and Law Enforcement Studies

MethodsLinear Regression · ALIGN