ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu

TL;DR
ASTRA is an automated framework that continuously discovers, evolves, and manages attack strategies against LLMs, improving jailbreak effectiveness through self-learning and hierarchical strategy management.
Contribution
It introduces a novel closed-loop mechanism and a dynamic strategy library enabling autonomous strategy discovery and evolution for LLM jailbreaks.
Findings
ASTRA outperforms existing methods in black-box attack scenarios.
The hierarchical strategy library improves attack efficiency and success rate.
ASTRA's self-evolving approach enhances adaptability of attack strategies.
Abstract
Despite extensive safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. However, existing methods generally lack the capability for continuous learning and self-evolution from interactions, limiting the diversity and adaptability of attack strategies. To address this, we propose ASTRA, an automated framework capable of autonomously discovering, retrieving, and evolving attack strategies. ASTRA operates on a closed-loop ``attack-evaluate-distill-reuse'' mechanism, which not only generates attack prompts but also automatically distills reusable strategies from every interaction. To systematically manage these strategies, we introduce a dynamic three-tier strategy library (Effective, Promising, and Ineffective) that categorizes strategies based on performance. This hierarchical memory mechanism enables the framework to enhance efficiency by leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
