REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Jiachen Ma; Jiawen Zhang; Xiangtian Li; Bo Zou; Chaochao Lu; Chao Yang

arXiv:2605.20654·cs.LG·May 21, 2026

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Jiachen Ma, Jiawen Zhang, Xiangtian Li, Bo Zou, Chaochao Lu, Chao Yang

PDF

1 Models

TL;DR

Reflector is a two-stage framework that internalizes self-reflection in LLMs to effectively defend against complex jailbreak attacks and improve task performance.

Contribution

It introduces a novel two-stage training process combining supervised fine-tuning and reinforcement learning to internalize self-reflection in LLMs for safety and utility.

Findings

01

Achieves over 90% success rate against indirect jailbreak attacks.

02

Improves GSM8K performance by 5.85%.

03

Enhances robustness across diverse threat scenarios.

Abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
krystal7/llama-8b-reflect-sft
model· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.