Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang; Haoran Li; Hao Peng; Ziqian Zeng; Zihao Wang; Haohua Du; Zhengtao Yu

arXiv:2508.00555·cs.CR·April 16, 2026

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu

PDF

2 Repos

TL;DR

This paper introduces AGILE, a two-stage activation-guided local editing framework for jailbreaking language models, achieving high success rates and transferability while resisting defenses.

Contribution

It presents a novel two-stage method combining scenario-based rephrasing and hidden state-guided edits to improve jailbreak effectiveness.

Findings

01

Achieves up to 37.74% higher attack success rate over baselines.

02

Demonstrates strong transferability to black-box models.

03

Remains effective against prominent defense mechanisms.

Abstract

Jailbreaking is an essential adversarial technique for red-teaming these models to uncover and patch security flaws. However, existing jailbreak methods face significant drawbacks. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage then utilizes information from the model's hidden states to guide fine-grained edits, effectively steering the model's internal representation of the input from a malicious toward a benign one. Extensive experiments demonstrate that this method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.