TL;DR
MixAT introduces a novel adversarial training method for LLMs that combines discrete and continuous attacks, significantly improving robustness against harmful outputs while maintaining computational efficiency.
Contribution
The paper presents MixAT, a new approach that effectively combines discrete and continuous adversarial attacks during training to enhance LLM robustness.
Findings
MixAT reduces worst-case attack success rate to below 20%.
It maintains runtime efficiency comparable to continuous relaxation methods.
MixAT reveals additional vulnerabilities in deployment settings.
Abstract
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗INSAIT-Institute/Mistral-7B-MixATmodel· 228 dl228 dl
- 🤗INSAIT-Institute/Zephyr-7B-MixATmodel· 216 dl216 dl
- 🤗INSAIT-Institute/Zephyr-7B-MixAT-GCGmodel· 219 dl219 dl
- 🤗INSAIT-Institute/Llama3-8B-MixAT-GCGmodel· 2.0k dl2.0k dl
- 🤗INSAIT-Institute/Llama3-8B-MixATmodel· 2.0k dl2.0k dl
- 🤗INSAIT-Institute/Llama3.1-8B-MixATmodel· 3 dl3 dl
- 🤗INSAIT-Institute/Qwen-14B-MixATmodel· 15 dl15 dl
- 🤗INSAIT-Institute/Qwen-14B-MixAT-GCGmodel· 1.9k dl1.9k dl
- 🤗INSAIT-Institute/Qwen-32B-MixATmodel
Videos
Taxonomy
MethodsSparse Evolutionary Training
