AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang

TL;DR
AJAR is a flexible framework that enhances the evaluation of LLM safety by enabling multi-turn jailbreak algorithms to be orchestrated as callable services within a tool-aware runtime, improving attack success rates and realism.
Contribution
It introduces AJAR, a novel framework that exposes multi-turn jailbreak algorithms as callable services, allowing more realistic and effective red-teaming of LLMs under agent-like conditions.
Findings
AJAR improves attack success rates on HarmBench behaviors.
AJAR achieves earlier success in multi-turn attack scenarios.
AJAR reproduces Crescendo more effectively than PyRIT.
Abstract
Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops. Existing jailbreak frameworks still leave a gap between adaptive multi-turn attacks and agentic runtimes: attack algorithms are usually packaged as monolithic scripts, while agent harnesses rarely expose explicit abstractions for rollback, tool simulation, or strategy switching. We present AJAR, a red-teaming framework that exposes multi-turn jailbreak algorithms as callable MCP services and lets an Auditor Agent orchestrate them inside a tool-aware runtime built on Petri. AJAR integrates three representative attacks, namely Crescendo, ActorAttack, and X-Teaming, under a shared service interface for planning, prompt generation, optimization, evaluation, and context control. On 200 HarmBench validation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Spam and Phishing Detection
