Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Doron Shavit

arXiv:2602.16520·cs.CR·February 19, 2026

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Doron Shavit

PDF

Open Access

TL;DR

This paper introduces RLM-JB, a recursive language model-based framework for detecting jailbreak prompts in large language models, improving robustness against sophisticated evasion techniques through a procedural, multi-step analysis.

Contribution

It presents a novel recursive, procedure-oriented detection framework that normalizes, chunks, and aggregates evidence to identify jailbreak prompts more effectively.

Findings

01

Achieves 92.5-98.0% detection accuracy on adversarial inputs.

02

Maintains high precision of 98.99-100% with low false positives.

03

Effective across multiple LLM backends.

Abstract

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling