VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

Shruti Palaskar; Leon Gatys; Mona Abdelrahman; Mar Jacobo; Larry Lindsey; Rutika Moharir; Gunnar Lund; Yang Xu; Navid Shiee; Jeffrey Bigham; Charles Maalouf; Joseph Yitan Cheng

arXiv:2510.18214·cs.CV·December 4, 2025

VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces VLSU, a comprehensive framework for evaluating the safety of multimodal models by analyzing their joint understanding of vision and language, revealing significant performance gaps and safety risks.

Contribution

The paper presents VLSU, a new benchmark and evaluation method for systematically assessing multimodal safety and joint reasoning capabilities in AI models.

Findings

01

Models perform well on unimodal safety signals but poorly on joint reasoning tasks.

02

34% of joint safety errors occur despite correct individual modality classification.

03

Instruction framing can reduce over-blocking but may increase unsafe content acceptance.

Abstract

Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. The inclusion of a borderline safety category is meaningful, and its necessity points to an appropriate direction for this field. 2. The curation of 8,187 human-annotated real-image pairs, each categorized across harm types and joint safety patterns, is a significant step towards realistic, actionable safety evaluation.

Weaknesses

1. This paper does not contain several important related works that address multimodal safety evaluation benchmark, a key topic of this paper. A comparison with the following papers is necessary: ELITE [1], VLGuard [2], MLLMGuard [3], JailbreakV-28k [4] 2. There is insufficient information about the human annotators. The paper states, "The image grade is labeled by one senior expert grader", which could introduce bias. Furthermore, the criteria for human annotators to judge "borderline" cases a

Reviewer 02Rating 6Confidence 4

Strengths

- the problem is well motivated and formulated. The fact that safety assignments should consider the interplay of both text and image inputs is intuitive and easy to follow - introduces a comprehensive benchmark that will be a valuable contribution to the community - safety taxonomy is grounded in prior work - I appreciate the more nuanced three class classification over prevalent binary safe/unsafe - Clear and significant findings on gaps in joint reasoning - detailed error analysis in sect

Weaknesses

# Major The major weakness of this paper is a significant lack of details on the construction methodology which also limits reproducibility - further details on taxonomy guidelines (see below) - **Stage 1 ** what is the exact setting for the "systematic parameterization"? what prompts where used and what were the inputs used in this parameterization? - **Stage 2** There is no information provided on the image corpus used for retrieval. What is its size, origin, licensing? Is there some additio

Reviewer 03Rating 6Confidence 4

Strengths

* VLSU is comprehensive, with over 8,000 samples and 17 distinct safety patterns. * The notions of borderline cases and the triplet safety pattern are useful and could inspire follow-up work. * The experiments systematically expose the limitations of current models in multimodal safety understanding.

Weaknesses

First, the borderline class may be subjective. The borderline class is defined as educational, informative, or discussion contexts. However, such contexts (e.g., the knowledge of making chemical weapons) can be subjective. How does VLSU ensure objectivity in this category? Including case studies of borderline cases would improve the clarity of the paper. Second, the dataset generation pipeline is not sufficiently clear. For example: * The image repository used in the retrieval process lacks de

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)