USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
Baolin Zheng, Guanlin Chen, Hongqiong Zhong, Qingyang Teng, Yingshui Tan, Zhendong Liu, Weixun Wang, Jiaheng Liu, Jian Yang, Huiyun Jing, Jincheng Wei, Wenbo Su, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

TL;DR
This paper introduces USB, a comprehensive safety evaluation benchmark for Multimodal Large Language Models, addressing data quality, coverage, and risk combination gaps to improve security assessment.
Contribution
The paper presents USB, a novel, extensive benchmark with high-quality data, covering multiple risk categories and modality combinations, integrating synthetic data to enhance evaluation comprehensiveness.
Findings
Existing benchmarks are insufficiently comprehensive.
USB covers 61 risk sub-categories with 4 modality combinations.
Synthetic data generation enhances evaluation coverage.
Abstract
Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation results, which hinders the discovery and governance of security concerns. Besides, we argue that vulnerabilities to harmful queries and oversensitivity to harmless ones should be considered simultaneously in MLLMs safety evaluation, whereas these were previously considered separately. In this paper, to address these shortcomings, we introduce Unified Safety Benchmarks (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features high-quality queries,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1.Coverage & structure. A clear two-axis design (taxonomy × modality combos) with explicit attention to RIST/SIST, which most benchmarks miss. 2.Difficulty & discrimination. Lower average SR across models than prior sets; USB-Hard further separates systems while correlating with USB-Base (ρ≈0.98). 3.Trade-off view. Joint reporting of vulnerability (SR) and over-refusal (RR) is practical and often neglected. Metrics are defined and computed consistently.
1.A large share of items is synthetically generated, which can imprint generator-specific artifacts and cause domain shift from real photos/screenshots. This clouds external validity. I recommend: report metrics by source (public vs. synthetic) and add out-of-domain tests (real photos/screens vs. synthetic). 2. The evaluation relies on a single closed-source judge, which risks judge bias and hides sensitivity to hyperparameters. I suggest: add a multi-judge setup (mix of open- and closed-source)
Category coverage is a strength of the paper. Users only need to test on the Unified Safety Benchmark to obtain a comprehensive and reliable safety assessment without combining multiple benchmarks.
1. The paper states that current benchmarks have limited data volume, citing less than 5K as an example, but notes that 5K data points are already quite substantial for a benchmark. 2. It is unclear if the synthetic prompts accurately reflect real-world scenarios, particularly with the synthetic images, and whether they are sufficient for safety evaluation. This aspect appears to be a potential limitation of the paper. 3. The evaluated models do not appear comprehensive, and some may be outdat
1. The paper provides a thorough diagnosis of the key weaknesses in existing MLLM safety benchmarks—including insufficient data quality, limited risk coverage, and the neglect of modality combinations—and proposes targeted solutions such as automated data validation, expanded risk taxonomy, and the systematic design of four modality configurations. These are genuine and timely challenges that the work addresses in a structured and convincing manner. 2. The authors make a commendable effort to a
1. The paper primarily integrates existing benchmarks and standard practices into a larger dataset. The methodology—combining prior datasets, generating synthetic examples, and measuring refusal/safety rates—follows well-known recipes and lacks a genuinely new insight into safety measurement or model behavior. 2. The contribution reads more like a benchmark report than a scientific study. There is little theoretical motivation or analysis beyond data aggregation, and the results mostly confirm
- The paper's primary strength is its sheer scale and meticulous organization. The creation of a 61-category risk taxonomy and the systematic analysis across four modality combinations provide an unprecedentedly fine-grained tool for diagnosing MLLM safety weaknesses. This represents an engineering effort of high quality. - The authors don't just create a new dataset; they first perform a rigorous analysis of over 13 existing benchmarks to identify coverage gaps. Their data synthesis pipeline i
- The main weakness is the limited conceptual novelty. The paper follows the established paradigm of creating a static dataset of prompts and evaluating model responses. While the scale is impressive, it essentially represents a superset of existing efforts rather than a new way of thinking about evaluation. The field is arguably saturated with safety benchmarks (e.g., HarmBench, MMSafetyBench, VLSafe, VLSBench, etc.). This work feels like an escalation in a "benchmark arms race" rather than a p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
