Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities
Yiting Qu, Michael Backes, Yang Zhang

TL;DR
This paper evaluates vision-language models' ability to recognize unsafe concepts across modalities, identifies existing gaps, and proposes a reinforcement learning method to improve their safety alignment, supported by a new dataset and systematic analysis.
Contribution
Introduces the UnsafeConcepts dataset, evaluates VLMs' perception and alignment of unsafe concepts, and proposes an RL-based approach to enhance safety recognition across modalities.
Findings
Most VLMs can perceive unsafe concepts but sometimes misclassify them as safe.
A modality gap exists in open-source VLMs between visual and textual unsafe concept recognition.
The proposed RL-based method improves VLM safety alignment more effectively than baselines.
Abstract
Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images due to their internal ethical standards and powerful reasoning abilities. However, it is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images. To address this, we first compile the UnsafeConcepts dataset, featuring 75 unsafe concepts, i.e., ``Swastika,'' ``Sexual Harassment,'' and ``Assaults,'' along with associated 1.5K images. We then conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities. We assess eight popular VLMs and find that, although most VLMs accurately perceive unsafe concepts, they sometimes mistakenly classify these concepts as safe. We also identify a consistent modality gap among open-source VLMs in distinguishing between visual and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
