PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

Nanxi Li; Zhengyue Zhao; G. Edward Suh; Marco Pavone; Chaowei Xiao

arXiv:2508.18649·cs.CR·April 3, 2026

PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

Nanxi Li, Zhengyue Zhao, G. Edward Suh, Marco Pavone, Chaowei Xiao

PDF

1 Repo

TL;DR

PRISM is a structured reasoning framework that enhances the safety and robustness of vision-language models against complex multimodal threats while maintaining utility.

Contribution

It introduces a four-stage reasoning process combined with Monte Carlo Tree Search for improved multimodal safety in VLMs, outperforming existing methods.

Findings

01

Reduces attack success rates on JailbreakV-28K and VLBreak datasets.

02

Improves robustness against adaptive multimodal attacks.

03

Maintains utility on benign multimodal benchmarks.

Abstract

Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduc PRISM (Principled Reasoning for Integrated Safety in Multimodality), a System 2-like framework that aligns VLMs through a structured four-stage reasoning process explicitly designed to handle three distinct categories of multimodal safety violations. Our framework consists of two key components: a structured reasoning pipeline that analyzes each violation category in dedicated stages, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to refine reasoning quality through Direct Preference Optimization. Comprehensive evaluations show that PRISM substantially reduces attack success rates on JailbreakV-28K and VLBreak, improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SaFoLab-WISC/PRISM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.