Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

Sajib Biswas; Mao Nishino; Samuel Jacob Chacko; Xiuwen Liu

arXiv:2505.09820·cs.LG·May 16, 2025

Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel adversarial attack method on large language models using exponentiated gradient descent, which effectively jailbreaks models with higher success rates and efficiency than existing techniques.

Contribution

The paper develops an intrinsic optimization technique with exponentiated gradient descent and Bregman projection for more effective adversarial attacks on LLMs, with proven convergence and practical implementation.

Findings

01

Achieves higher success rate than existing methods

02

Demonstrates effectiveness on five open-source LLMs

03

Provides an efficient algorithm with theoretical convergence proof

Abstract

As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sbamit/exponentiated-gradient-descent-llm-attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning