Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu

TL;DR
This paper introduces a novel adversarial attack method on large language models using exponentiated gradient descent, which effectively jailbreaks models with higher success rates and efficiency than existing techniques.
Contribution
The paper develops an intrinsic optimization technique with exponentiated gradient descent and Bregman projection for more effective adversarial attacks on LLMs, with proven convergence and practical implementation.
Findings
Achieves higher success rate than existing methods
Demonstrates effectiveness on five open-source LLMs
Provides an efficient algorithm with theoretical convergence proof
Abstract
As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
