TL;DR
This paper introduces AGILE, a two-stage activation-guided local editing framework for jailbreaking language models, achieving high success rates and transferability while resisting defenses.
Contribution
It presents a novel two-stage method combining scenario-based rephrasing and hidden state-guided edits to improve jailbreak effectiveness.
Findings
Achieves up to 37.74% higher attack success rate over baselines.
Demonstrates strong transferability to black-box models.
Remains effective against prominent defense mechanisms.
Abstract
Jailbreaking is an essential adversarial technique for red-teaming these models to uncover and patch security flaws. However, existing jailbreak methods face significant drawbacks. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage then utilizes information from the model's hidden states to guide fine-grained edits, effectively steering the model's internal representation of the input from a malicious toward a benign one. Extensive experiments demonstrate that this method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
