TL;DR
This paper provides a theoretical understanding of continuous adversarial training (CAT) for large language models (LLMs) using in-context learning theory, explaining its effectiveness and proposing improvements based on singular value regularization.
Contribution
It offers the first theoretical analysis of CAT on LLMs, linking robustness to embedding matrix singular values and proposing a regularization method to enhance jailbreak robustness.
Findings
Theoretical proof of CAT's robustness related to embedding perturbation radius.
Robustness of adversarially trained LLMs is linked to singular values of embedding matrices.
Proposed regularization improves LLMs' jailbreak robustness-utility tradeoff.
Abstract
Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
