SafeDream: Safety World Model for Proactive Early Jailbreak Detection
Bo Yan, Weikai Lin, Yada Zhu, Song Wang

TL;DR
SafeDream introduces a proactive framework for early detection of multi-turn jailbreak attacks on large language models, outperforming existing methods in timeliness without modifying model weights.
Contribution
It proposes SAFEDREAM, a lightweight external module with a safety world model, CUSUM detection, and contrastive imagination for early jailbreak detection.
Findings
SAFEDREAM detects attacks 1.06-1.20 turns before compliance.
It maintains competitive false positive rates.
Outperforms baselines in detection timeliness.
Abstract
Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
