Information Theoretic Adversarial Training of Large Language Models
Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia, Jason Pacheco, Elisa Bertino

TL;DR
This paper introduces WARDEN, an information-theoretic adversarial training framework for large language models that enhances robustness against adversarial prompts by dynamically reweighting challenging examples.
Contribution
It proposes a novel distributionally robust adversarial training method using f-divergence, improving robustness efficiently while maintaining utility.
Findings
WARDEN significantly reduces attack success rates across multiple LLMs.
The method maintains computational efficiency comparable to existing approaches.
WARDEN offers a scalable solution for robust alignment of large language models.
Abstract
Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
