Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu; Jonas Dornbusch; David L\"udke; Stephan G\"unnemann; Leo Schwinn

arXiv:2602.15238·cs.LG·February 19, 2026

Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch, David L\"udke, Stephan G\"unnemann, Leo Schwinn

PDF

Open Access 3 Models

TL;DR

This paper introduces Distributional Adversarial Training (DAT), a novel method that uses diffusion models to better approximate data distribution, significantly improving adversarial robustness of large language models against simple exploits.

Contribution

The paper proposes DAT, which leverages diffusion models to generate diverse training samples, addressing distribution coverage gaps in adversarial training for LLMs.

Findings

01

DAT achieves higher robustness against simple attacks.

02

Diffusion models improve data distribution approximation.

03

Significant robustness gains over previous methods.

Abstract

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing