Towards Policy-Adaptive Image Guardrail: Benchmark and Method
Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou

TL;DR
This paper introduces SafeEditBench for benchmarking cross-policy generalization of vision-language models and proposes SafeGuard-VL, a reinforcement learning method for adaptable unsafe image guardrails.
Contribution
It presents a new evaluation suite for policy generalization and a reinforcement learning approach for dynamic safety policy adaptation in visual content filtering.
Findings
Existing VLMs overfit to seen policies and fail to generalize.
SafeGuard-VL improves policy adaptation and safety compliance.
SafeEditBench enables fine-grained assessment of policy-aware generalization.
Abstract
Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
