Adversarial Attacks on Large Language Models Using Regularized   Relaxation

Samuel Jacob Chacko; Sajib Biswas; Chashi Mahiul Islam; Fatema; Tabassum Liza; Xiuwen Liu

arXiv:2410.19160·cs.LG·October 28, 2024

Adversarial Attacks on Large Language Models Using Regularized Relaxation

Samuel Jacob Chacko, Sajib Biswas, Chashi Mahiul Islam, Fatema, Tabassum Liza, Xiuwen Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel adversarial attack method on large language models that uses regularized gradients for efficient, valid token generation, significantly improving attack success rates over existing techniques.

Contribution

The paper presents a new regularized gradient-based attack method that is faster and more effective, generating valid tokens and overcoming limitations of previous approaches.

Findings

01

Two orders of magnitude faster than previous methods

02

Significantly higher attack success rate on multiple LLMs

03

Effective across five state-of-the-art models and four datasets

Abstract

As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to carefully crafted adversarial inputs. Consequently, adversarial attack methods are extensively used to study and understand these vulnerabilities. However, current attack methods face significant limitations. Those relying on optimizing discrete tokens suffer from limited efficiency, while continuous optimization techniques fail to generate valid tokens from the model's vocabulary, rendering them impractical for real-world applications. In this paper, we propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods. Our approach is two orders of magnitude faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sj21j/Regularized_Relaxation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling