Information Theoretic Adversarial Training of Large Language Models

Yiwei Zhang; Jeremiah Birrell; Reza Ebrahimi; Rouzbeh Behnia; Jason Pacheco; Elisa Bertino

arXiv:2605.05415·cs.LG·May 8, 2026

Information Theoretic Adversarial Training of Large Language Models

Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia, Jason Pacheco, Elisa Bertino

PDF

TL;DR

This paper introduces WARDEN, an information-theoretic adversarial training framework for large language models that enhances robustness against adversarial prompts by dynamically reweighting challenging examples.

Contribution

It proposes a novel distributionally robust adversarial training method using f-divergence, improving robustness efficiently while maintaining utility.

Findings

01

WARDEN significantly reduces attack success rates across multiple LLMs.

02

The method maintains computational efficiency comparable to existing approaches.

03

WARDEN offers a scalable solution for robust alignment of large language models.

Abstract

Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.