Efficient Adversarial Training in LLMs with Continuous Attacks

Sophie Xhonneux; Alessandro Sordoni; Stephan G\"unnemann; Gauthier; Gidel; Leo Schwinn

arXiv:2405.15589·cs.LG·November 4, 2024·3 cites

Efficient Adversarial Training in LLMs with Continuous Attacks

Sophie Xhonneux, Alessandro Sordoni, Stephan G\"unnemann, Gauthier, Gidel, Leo Schwinn

PDF

Open Access 1 Repo 6 Models

TL;DR

This paper introduces a computationally efficient adversarial training method for large language models by performing attacks in continuous embedding space, significantly improving robustness against discrete attacks while maintaining utility.

Contribution

It proposes a novel fast adversarial training algorithm (C-AdvUL) and an adversarial IPO variant (C-AdvIPO) that operate in embedding space, reducing computational costs for LLM robustness training.

Findings

01

Enhanced robustness of LLMs against discrete attacks

02

Maintained utility of models after adversarial training

03

Scalable approach applicable to various LLM architectures

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sophie-xhonneux/continuous-advtrain
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning