HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Youngwan Lee; Kangsan Kim; Kwanyong Park; Ilcahe Jung; Soojin Jang; Seanie Lee; Yong-Ju Lee; Sung Ju Hwang

arXiv:2506.04704·cs.CV·November 26, 2025

HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

PDF

6 Models 1 Datasets 3 Reviews

TL;DR

HoliSafe introduces a comprehensive safety benchmark and a modular visual guard to improve the safety and interpretability of vision-language models, addressing current limitations in safety evaluation and architectural robustness.

Contribution

The paper presents HoliSafe, a holistic safety dataset and benchmark, along with a modular visual guard module that enhances safety and interpretability of VLMs.

Findings

01

HoliSafe-Bench exposes vulnerabilities in existing VLMs.

02

Safe-VLM with VGM achieves state-of-the-art safety performance.

03

The visual guard module provides interpretable harmfulness classification.

Abstract

Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM)…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. Regarding the safety benchmarks, this paper combines different images and texts together to make it contain a large variety of image-text combination types.

Weaknesses

1. As we can see from Table 1, compared with the existing benchmarks, HoliSafe does not include new types of combinations. In other words, I am confused about the motivation of this work. Why can't we just merge all the existing benchmarks together to obtain a more comprehensive dataset? 2. Due to the ill-posed motivation, I am concerned about the significance of this work for both academia and industrial communities. 3. The VGM method is incremental. By including a classifier in the model to

Reviewer 02Rating 4Confidence 4

Strengths

1. The proposed dataset is comprehensive and contains various multimodal safety combinations. 2. The paper is well organized 3. The VSG design is novel and interpretable 4. The performance of the fine-tuned model is strong in both safety and utility tasks.

Weaknesses

1. **The Reliability of Categorizing Images by Safety Category and Safeness** - The images are labeled by human experts and GPT-4o. However, the criteria for determining image safety are somewhat ambiguous. For example, I believe the second image in Figure 1 is safe, as it only depicts an ID card and keys. In contrast, the fourth image in Figure 1 conveys a sense of violent intent, which I consider unsafe. 2. **Lack Validation of GPT-4o Generated Instruction-Response Pairs** - The paper la

Reviewer 03Rating 4Confidence 4

Strengths

- The experiments, evaluated models, and datasets are comprehensive, reflecting substantial effort and careful execution. I appreciate the thoroughness of the evaluation and the significant work that clearly went into conducting it. - The motivation to build a comprehensive benchmark that systematically covers all possible combinations of modality fusions is strong. To the best of my knowledge and based on the paper’s claims, prior works and benchmarks have not achieved this level of completene

Weaknesses

- While the paper criticizes prior works for lacking architectural innovations to enhance safety, its own proposed Visual Guard Module (VGM) is conceptually simple; a small MLP added on top of the VLM, and thus does not constitute a substantial architectural contribution. I believe the contribution should still emphasize the benchmark and the data. - Since the proposed approach updates not only the new module but also the vision encoder and adapter weights, unlike most standard post-training s

Code & Models

Models

Datasets

etri-vilab/holisafe-bench
dataset· 910 dl
910 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.