Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

Yiting Qu; Michael Backes; Yang Zhang

arXiv:2507.11155·cs.CR·July 16, 2025

Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

Yiting Qu, Michael Backes, Yang Zhang

PDF

Open Access 1 Repo

TL;DR

This paper evaluates vision-language models' ability to recognize unsafe concepts across modalities, identifies existing gaps, and proposes a reinforcement learning method to improve their safety alignment, supported by a new dataset and systematic analysis.

Contribution

Introduces the UnsafeConcepts dataset, evaluates VLMs' perception and alignment of unsafe concepts, and proposes an RL-based approach to enhance safety recognition across modalities.

Findings

01

Most VLMs can perceive unsafe concepts but sometimes misclassify them as safe.

02

A modality gap exists in open-source VLMs between visual and textual unsafe concept recognition.

03

The proposed RL-based method improves VLM safety alignment more effectively than baselines.

Abstract

Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images due to their internal ethical standards and powerful reasoning abilities. However, it is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images. To address this, we first compile the UnsafeConcepts dataset, featuring 75 unsafe concepts, i.e., ``Swastika,'' ``Sexual Harassment,'' and ``Assaults,'' along with associated 1.5K images. We then conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities. We assess eight popular VLMs and find that, although most VLMs accurately perceive unsafe concepts, they sometimes mistakenly classify these concepts as safe. We also identify a consistent modality gap among open-source VLMs in distinguishing between visual and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trustairlab/safervlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques