VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng

TL;DR
This paper introduces VLSU, a comprehensive framework for evaluating the safety of multimodal models by analyzing their joint understanding of vision and language, revealing significant performance gaps and safety risks.
Contribution
The paper presents VLSU, a new benchmark and evaluation method for systematically assessing multimodal safety and joint reasoning capabilities in AI models.
Findings
Models perform well on unimodal safety signals but poorly on joint reasoning tasks.
34% of joint safety errors occur despite correct individual modality classification.
Instruction framing can reduce over-blocking but may increase unsafe content acceptance.
Abstract
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy…
Peer Reviews
Decision·ICLR 2026 Poster
1. The inclusion of a borderline safety category is meaningful, and its necessity points to an appropriate direction for this field. 2. The curation of 8,187 human-annotated real-image pairs, each categorized across harm types and joint safety patterns, is a significant step towards realistic, actionable safety evaluation.
1. This paper does not contain several important related works that address multimodal safety evaluation benchmark, a key topic of this paper. A comparison with the following papers is necessary: ELITE [1], VLGuard [2], MLLMGuard [3], JailbreakV-28k [4] 2. There is insufficient information about the human annotators. The paper states, "The image grade is labeled by one senior expert grader", which could introduce bias. Furthermore, the criteria for human annotators to judge "borderline" cases a
- the problem is well motivated and formulated. The fact that safety assignments should consider the interplay of both text and image inputs is intuitive and easy to follow - introduces a comprehensive benchmark that will be a valuable contribution to the community - safety taxonomy is grounded in prior work - I appreciate the more nuanced three class classification over prevalent binary safe/unsafe - Clear and significant findings on gaps in joint reasoning - detailed error analysis in sect
# Major The major weakness of this paper is a significant lack of details on the construction methodology which also limits reproducibility - further details on taxonomy guidelines (see below) - **Stage 1 ** what is the exact setting for the "systematic parameterization"? what prompts where used and what were the inputs used in this parameterization? - **Stage 2** There is no information provided on the image corpus used for retrieval. What is its size, origin, licensing? Is there some additio
* VLSU is comprehensive, with over 8,000 samples and 17 distinct safety patterns. * The notions of borderline cases and the triplet safety pattern are useful and could inspire follow-up work. * The experiments systematically expose the limitations of current models in multimodal safety understanding.
First, the borderline class may be subjective. The borderline class is defined as educational, informative, or discussion contexts. However, such contexts (e.g., the knowledge of making chemical weapons) can be subjective. How does VLSU ensure objectivity in this category? Including case studies of borderline cases would improve the clarity of the paper. Second, the dataset generation pipeline is not sufficiently clear. For example: * The image repository used in the retrieval process lacks de
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
