AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
Tanmay Gautam, Alireza Bahramali, Sandeep Atluri

TL;DR
AutoRISE introduces a novel approach to red-teaming large language models by optimizing attack strategies through executable programs, significantly improving attack success rates without requiring fine-tuning or human annotation.
Contribution
It proposes a method that searches over executable attack programs, enabling structural strategy changes and outperforming traditional prompt-level methods.
Findings
AutoRISE improves attack success rate by 17 points over baselines.
It achieves up to 16 points improvement on frontier targets.
The method operates without fine-tuning or GPU compute.
Abstract
Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
