TL;DR
SRTJ introduces a training-free, self-evolving framework for systematically discovering, composing, and refining jailbreak strategies against LLMs, leveraging feedback and rule organization to improve attack robustness and transferability.
Contribution
It presents a novel, training-free approach that combines experience-driven attack generation with ASP-based rule selection and hierarchical rule memory for effective jailbreaks.
Findings
Achieves strong attack performance across different LLMs.
Demonstrates improved robustness and generalization over existing methods.
Utilizes a hierarchical rule memory for effective strategy organization.
Abstract
LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing body of work has explored automated jailbreak strategies, existing methods face several fundamental challenges, including the lack of systematic utilization of both successful and failed attack experiences, as well as the absence of principled mechanisms for composing and selecting reusable attack rules under diverse constraints. As a result, existing methods struggle to accumulate transferable knowledge over time and to reliably adapt attack strategies across different targets and evolving safety mechanisms. To address these issues, we propose a Self-Evolving Rule-Driven Training-Free Jailbreak (SRTJ) framework that systematically discovers, composes, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
