SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan; Weikai Lin; Yada Zhu; Song Wang

arXiv:2604.16824·cs.CR·April 21, 2026

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan, Weikai Lin, Yada Zhu, Song Wang

PDF

TL;DR

SafeDream introduces a proactive framework for early detection of multi-turn jailbreak attacks on large language models, outperforming existing methods in timeliness without modifying model weights.

Contribution

It proposes SAFEDREAM, a lightweight external module with a safety world model, CUSUM detection, and contrastive imagination for early jailbreak detection.

Findings

01

SAFEDREAM detects attacks 1.06-1.20 turns before compliance.

02

It maintains competitive false positive rates.

03

Outperforms baselines in detection timeliness.

Abstract

Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.