HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

TL;DR
HoliSafe introduces a comprehensive safety benchmark and a modular visual guard to improve the safety and interpretability of vision-language models, addressing current limitations in safety evaluation and architectural robustness.
Contribution
The paper presents HoliSafe, a holistic safety dataset and benchmark, along with a modular visual guard module that enhances safety and interpretability of VLMs.
Findings
HoliSafe-Bench exposes vulnerabilities in existing VLMs.
Safe-VLM with VGM achieves state-of-the-art safety performance.
The visual guard module provides interpretable harmfulness classification.
Abstract
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM)…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Regarding the safety benchmarks, this paper combines different images and texts together to make it contain a large variety of image-text combination types.
1. As we can see from Table 1, compared with the existing benchmarks, HoliSafe does not include new types of combinations. In other words, I am confused about the motivation of this work. Why can't we just merge all the existing benchmarks together to obtain a more comprehensive dataset? 2. Due to the ill-posed motivation, I am concerned about the significance of this work for both academia and industrial communities. 3. The VGM method is incremental. By including a classifier in the model to
1. The proposed dataset is comprehensive and contains various multimodal safety combinations. 2. The paper is well organized 3. The VSG design is novel and interpretable 4. The performance of the fine-tuned model is strong in both safety and utility tasks.
1. **The Reliability of Categorizing Images by Safety Category and Safeness** - The images are labeled by human experts and GPT-4o. However, the criteria for determining image safety are somewhat ambiguous. For example, I believe the second image in Figure 1 is safe, as it only depicts an ID card and keys. In contrast, the fourth image in Figure 1 conveys a sense of violent intent, which I consider unsafe. 2. **Lack Validation of GPT-4o Generated Instruction-Response Pairs** - The paper la
- The experiments, evaluated models, and datasets are comprehensive, reflecting substantial effort and careful execution. I appreciate the thoroughness of the evaluation and the significant work that clearly went into conducting it. - The motivation to build a comprehensive benchmark that systematically covers all possible combinations of modality fusions is strong. To the best of my knowledge and based on the paper’s claims, prior works and benchmarks have not achieved this level of completene
- While the paper criticizes prior works for lacking architectural innovations to enhance safety, its own proposed Visual Guard Module (VGM) is conceptually simple; a small MLP added on top of the VLM, and thus does not constitute a substantial architectural contribution. I believe the contribution should still emphasize the benchmark and the data. - Since the proposed approach updates not only the new module but also the vision encoder and adapter weights, unlike most standard post-training s
Code & Models
- 🤗etri-vilab/SafeGem-12Bmodel· 15 dl· ♡ 315 dl♡ 3
- 🤗etri-vilab/SafeGem-27Bmodel· 13 dl· ♡ 313 dl♡ 3
- 🤗etri-vilab/SafeQwen2.5-VL-7Bmodel· 362 dl· ♡ 3362 dl♡ 3
- 🤗etri-vilab/SafeQwen2.5-VL-32Bmodel· 218 dl· ♡ 3218 dl♡ 3
- 🤗etri-vilab/SafeLLaVA-13Bmodel· 20 dl· ♡ 320 dl♡ 3
- 🤗etri-vilab/SafeLLaVA-7Bmodel· 21 dl· ♡ 321 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
