Efficient Adversarial Training in LLMs with Continuous Attacks
Sophie Xhonneux, Alessandro Sordoni, Stephan G\"unnemann, Gauthier, Gidel, Leo Schwinn

TL;DR
This paper introduces a computationally efficient adversarial training method for large language models by performing attacks in continuous embedding space, significantly improving robustness against discrete attacks while maintaining utility.
Contribution
It proposes a novel fast adversarial training algorithm (C-AdvUL) and an adversarial IPO variant (C-AdvIPO) that operate in embedding space, reducing computational costs for LLM robustness training.
Findings
Enhanced robustness of LLMs against discrete attacks
Maintained utility of models after adversarial training
Scalable approach applicable to various LLM architectures
Abstract
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
