Adversarial D\'ej\`a Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas; Tran Huynh; Nikhil Reddy Billa; Jiachen T. Wang; Peng Gao; Charith Peris; Yao Ma; Rahul Gupta; Ming Jin; Prateek Mittal; and Ruoxi Jia

arXiv:2510.21910·cs.LG·March 3, 2026

Adversarial D\'ej\`a Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, and Ruoxi Jia

PDF

Open Access

TL;DR

This paper introduces a new defense paradigm for large language models against unseen jailbreak attacks by modeling them as compositions of learned adversarial skills, and proposes a training method that enhances robustness through skill composition.

Contribution

The paper proposes the Adversarial D'ejà Vu hypothesis and introduces ASCoT, a training approach that improves model robustness by focusing on compositional adversarial skills rather than isolated attacks.

Findings

01

Unseen jailbreaks can be explained as compositions of previous adversarial skills.

02

ASCoT significantly improves robustness to novel attacks.

03

Expanding adversarial skill coverage enhances defense effectiveness.

Abstract

Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training -- designed to make models robust against worst-case perturbations -- has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial D\'ej\`a Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI