Towards Policy-Adaptive Image Guardrail: Benchmark and Method

Caiyong Piao; Zhiyuan Yan; Haoming Xu; Yunzhen Zhao; Kaiqing Lin; Feiyang Xu; Shuigeng Zhou

arXiv:2603.01228·cs.CV·April 1, 2026

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou

PDF

1 Models 1 Datasets

TL;DR

This paper introduces SafeEditBench for benchmarking cross-policy generalization of vision-language models and proposes SafeGuard-VL, a reinforcement learning method for adaptable unsafe image guardrails.

Contribution

It presents a new evaluation suite for policy generalization and a reinforcement learning approach for dynamic safety policy adaptation in visual content filtering.

Findings

01

Existing VLMs overfit to seen policies and fail to generalize.

02

SafeGuard-VL improves policy adaptation and safety compliance.

03

SafeEditBench enables fine-grained assessment of policy-aware generalization.

Abstract

Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tyodd/SafeGuard-VL-RL
model· 195 dl
195 dl

Datasets

tyodd/SafeEditBench
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.