Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu; Di Wang

arXiv:2604.12817·cs.LG·April 15, 2026

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu, Di Wang

PDF

1 Repo

TL;DR

This paper provides a theoretical understanding of continuous adversarial training (CAT) for large language models (LLMs) using in-context learning theory, explaining its effectiveness and proposing improvements based on singular value regularization.

Contribution

It offers the first theoretical analysis of CAT on LLMs, linking robustness to embedding matrix singular values and proposing a regularization method to enhance jailbreak robustness.

Findings

01

Theoretical proof of CAT's robustness related to embedding perturbation radius.

02

Robustness of adversarially trained LLMs is linked to singular values of embedding matrices.

03

Proposed regularization improves LLMs' jailbreak robustness-utility tradeoff.

Abstract

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fshp971/continuous-adv-icl
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.