TL;DR
SafeVision is a novel, adaptable, and explainable image guardrail system that dynamically aligns with safety policies, outperforming existing models in accuracy and speed, and introduces a new dataset for harmful content detection.
Contribution
It introduces SafeVision, a dynamic, policy-adherent image guardrail with explainability, and the VisionHarm dataset for comprehensive harmful content benchmarking.
Findings
SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T.
SafeVision is over 16 times faster than baseline models.
SafeVision achieves state-of-the-art performance on multiple benchmarks.
Abstract
With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is in good structure; easy to follow 2. The idea of building image guardrail models is important to the community.
- The motivation for introducing the VISIONHARM is vague to me. What is the difference between your proposed benchmark and the safety dataset in the VLM domain (e.g., [1] since they also contain many unsafe images) - I am concerned about the performance of the baselines such as GPT-4o and Llama-guard; They perform poorly on the Weapon dataset. Have you tried to tune their prompts (e.g., by giving more information about what is unsafe in the context?) Then, how do you explain their poor performan
1. This paper introduces a newly curated datasets that could support future safety research. 2. The authors explicitly consider computational efficiency, which is a crucial factor when deploying safety guardrails.
My concerns fall into two main categories — experimental soundness and technical novelty. 1. SAFEVISION is compared to safety baselines (e.g., Q16, Multi-Headed, LLaVAGuard) that could also be fine-tuned on the proposed dataset, but appear not to be. In addition, SAFEVISION’s refinement process uses powerful external models (e.g., GPT-4o, Qwen-VL, InternVL) unavailable to baselines, making results not directly comparable. The authors are encouraged to retrain baselines under comparable conditio
- New dataset contribution. - New image guardrail. - Strong performance.
- The Related Work section (Section 2) lacks sufficient discussion of prior studies, especially detailed comparisons with state-of-the-art works [a,b]. For instance, I found two recent papers whose dataset construction and image guardrail designs are highly similar to the authors’ contributions. From the data collection perspective, both works incorporate multiple policies. LlamaGuard also claims to support flexible policy configurations, and UnsafeBench collected a large number of unsafe images
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
