AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Sicheng Zhu; Brandon Amos; Yuandong Tian; Chuan Guo; Ivan Evtimov

arXiv:2412.10321·cs.LG·December 30, 2025

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov

PDF

Open Access 1 Repo

TL;DR

AdvPrefix is a novel prefix-forcing objective that enhances the effectiveness and flexibility of jailbreak attacks on large language models by selecting optimal prefixes based on success rates and likelihood.

Contribution

It introduces a plug-and-play prefix-forcing objective that improves jailbreak success and control, addressing limitations of previous methods.

Findings

01

Improves nuanced attack success rates from 14% to 80% on Llama-3.

02

Seamlessly integrates into existing jailbreak attacks.

03

Reveals current safety alignment limitations.

Abstract

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/jailbreak-objectives
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics