Adversarial Attacks on Large Language Models Using Regularized Relaxation
Samuel Jacob Chacko, Sajib Biswas, Chashi Mahiul Islam, Fatema, Tabassum Liza, Xiuwen Liu

TL;DR
This paper introduces a novel adversarial attack method on large language models that uses regularized gradients for efficient, valid token generation, significantly improving attack success rates over existing techniques.
Contribution
The paper presents a new regularized gradient-based attack method that is faster and more effective, generating valid tokens and overcoming limitations of previous approaches.
Findings
Two orders of magnitude faster than previous methods
Significantly higher attack success rate on multiple LLMs
Effective across five state-of-the-art models and four datasets
Abstract
As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to carefully crafted adversarial inputs. Consequently, adversarial attack methods are extensively used to study and understand these vulnerabilities. However, current attack methods face significant limitations. Those relying on optimizing discrete tokens suffer from limited efficiency, while continuous optimization techniques fail to generate valid tokens from the model's vocabulary, rendering them impractical for real-world applications. In this paper, we propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods. Our approach is two orders of magnitude faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
