SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, Jing Shao

TL;DR
This paper introduces SPA-VL, a large-scale, diverse dataset designed to improve safety alignment in Vision Language Models by providing extensive harmfulness coverage and automated data collection from multiple models.
Contribution
The paper presents SPA-VL, a comprehensive safety preference dataset for VLMs, addressing the lack of large, high-quality datasets for safety alignment in multimodal models.
Findings
Models trained on SPA-VL show improved harmlessness.
Enhanced helpfulness in VLMs after training on SPA-VL.
SPA-VL covers extensive harmfulness categories.
Abstract
The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open-source (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The construction of preference data is fully…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* The release of a public dataset for VLM alignment is sound and an important asset for the community, as only aligning the LLM is not enough to ensure full alignment, as the authors rightfully explain (L-40). * The size and coverage of the dataset are large * Experiments are extensive to measure the impact the dataset can have on the alignment of multiple VLMs, with ablation on the dataset size, variations in the alignment methodologies (DPO, PPO, projections, etc…).
* The dataset construction process requires the assessment of unsafe content from the generated questions and answers, which is done with MD-Judge (L-130-132). However, this model is trained to assess natural language QA, without considerations of images. Then, how is the dataset creation specifically tackling *VLM* alignment? * The dataset construction relies on closed-source, proprietary models (Gemini), whose API may change through time, and hinders the reproducibility of the dataset constr
1. This paper is well-written and organized. The overall architecture is reasonable and easy to follow the author's statement. 2. The appendix is clear and detailed. This thoroughness ensures that readers can easily understand the methodologies and results discussed in the paper.
1. The experimental results only present the performance of the proposed dataset, without a broader comparison. 2. The analysis in the introduction lacks depth and evidence. For example, the author can use a specific sample to demonstrate the alignment advantages of this dataset for visual modalities. 3. It seems that there is no discussion on whether there is a data breach issue with this dataset, which may affect the evaluation results.
(1) It contributes a important and large-scale Safety Preference Alignment dataset for the VLM research community (2) Very rich experiments have been carried out to illustrate the validity of the proposed dataset.
In Table 12, most of the indicators are on a downward trend.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Warnings and Signage
