GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Raj Basani, Xiao Zhang

TL;DR
GASP is a novel black-box framework that efficiently generates human-readable adversarial prompts to jailbreak large language models, improving success rates while reducing computational costs.
Contribution
It introduces a Bayesian optimization-based method for generating natural adversarial suffixes in a fully black-box setting, enhancing scalability and effectiveness.
Findings
GASP significantly outperforms baseline methods in jailbreak success rate.
It reduces training time and accelerates inference for adversarial prompt generation.
GASP produces more natural and human-readable prompts compared to previous approaches.
Abstract
LLMs have shown impressive capabilities across various natural language processing tasks, yet remain vulnerable to input prompts, known as jailbreak attacks, carefully designed to bypass safety guardrails and elicit harmful responses. Traditional methods rely on manual heuristics but suffer from limited generalizability. Despite being automatic, optimization-based attacks often produce unnatural prompts that can be easily detected by safety filters or require high computational costs due to discrete token optimization. In this paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel automated framework that can efficiently generate human-readable jailbreak prompts in a fully black-box setting. In particular, GASP leverages latent Bayesian optimization to craft adversarial suffixes by efficiently exploring continuous latent embedding spaces, gradually optimizing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Vehicle License Plate Recognition · Digital Media Forensic Detection
