Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Sajib Biswas; Mao Nishino; Samuel Jacob Chacko; Xiuwen Liu

arXiv:2508.14853·cs.LG·August 21, 2025

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu

PDF

Open Access

TL;DR

This paper introduces a novel optimization-based method using exponentiated gradient descent to craft effective, universal adversarial suffixes that can jailbreak large language models, demonstrating higher success rates and transferability.

Contribution

It presents a new intrinsic optimization technique for adversarial attacks on LLMs that is more effective and transferable than existing methods, with theoretical convergence guarantees.

Findings

01

Achieves higher success rates than state-of-the-art baselines.

02

Effectively generates universal adversarial suffixes.

03

Demonstrates transferability across different LLMs.

Abstract

As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning