Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Doron Shavit

TL;DR
This paper introduces RLM-JB, a recursive language model-based framework for detecting jailbreak prompts in large language models, improving robustness against sophisticated evasion techniques through a procedural, multi-step analysis.
Contribution
It presents a novel recursive, procedure-oriented detection framework that normalizes, chunks, and aggregates evidence to identify jailbreak prompts more effectively.
Findings
Achieves 92.5-98.0% detection accuracy on adversarial inputs.
Maintains high precision of 98.99-100% with low false positives.
Effective across multiple LLM backends.
Abstract
Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling
