Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models
Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang

TL;DR
This paper introduces a three-stage training framework called Think-Reflect-Revise (TRR) that significantly improves the safety alignment of Large Vision Language Models by leveraging self-reflection and policy guidance to prevent unsafe outputs.
Contribution
The paper proposes a novel three-stage training framework, TRR, which uses a new dataset and reinforcement learning to enhance safety in LVLMs through self-reflection and correction.
Findings
Safety response rate increased from 42.8% to 87.7%.
TRR maintains performance on general benchmarks.
Reflective training improves safety awareness and robustness.
Abstract
As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Topic Modeling
