Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan; Zheyu Fu; Yunpeng Zhai; Shuchang Tao; Sheng Guan; Shiyu Huang; Lingzhe Zhang; Zhaoyang Liu; Bolin Ding; Felix Henry; Aiwei Liu; Lijie Wen

arXiv:2508.07173·cs.CL·September 30, 2025

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Aiwei Liu, Lijie Wen

PDF

1 Datasets 4 Reviews

TL;DR

Omni-SafetyBench is the first comprehensive benchmark designed to evaluate the safety of audio-visual large language models across multiple modalities, revealing critical vulnerabilities and challenges in current safety alignment methods.

Contribution

The paper introduces Omni-SafetyBench, a novel benchmark with tailored metrics for assessing safety and cross-modal consistency in OLLMs, filling a significant evaluation gap.

Findings

01

Most models have safety scores below 0.6, indicating vulnerabilities.

02

Safety defenses weaken with complex audio-visual inputs.

03

Some models score as low as 0.14 on specific modalities.

Abstract

The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

1. It is the first safety benchmark to cover audio-visual-text joint inputs, filling the modality-coverage gap highlighted in Table 1 and furnishing the community with a foundational dataset. 2. The paper introduces a comprehension-aware safety philosophy that assesses safety only after understanding, which is more realistic for human-computer interaction than simply reporting ASR, even if the implementation is flawed. 3. The experimental scale is solid: 23,328 samples across 10 models with mu

Weaknesses

1. The paper treats "not understood" as "excluded from safety evaluation," yet it relies solely on an LLM-as-judge to decide whether an input is understood. Appendix E.2 shows this judge misclassifies understanding at a rate of 11 percent, and the authors provide no robustness check such as a sufficient human validation sample, so C-ASR and C-RR are systematically biased. 2. Every audio-visual sample is generated automatically from seed text taken from MM-SafetyBench. The authors never verify c

Reviewer 02Rating 2Confidence 3

Strengths

*The paper identifies the safety of OLLMs as a critical and underexplored research challenge, providing valuable insights that could advance future work in multimodal AI safety. * The introduction and experimental results are presented in a clear, logical, and well-structured manner, effectively guiding readers through the motivation, methodology, and findings. * The dataset construction demonstrates a cost-effective and scalable strategy by extending existing resources to create diverse multi

Weaknesses

* The definitions and implementation details of C-ASR and C-RR are unclear, particularly regarding how the authors determine when a model “understands” an input or produces a "safe" response. This lack of transparency raises concerns about reproducibility and makes it difficult to interpret the reported safety scores with confidence. * The motivation and theoretical justification for the CMSC-score are insufficient. It is not well explained why the standard deviation across subcategories serves

Reviewer 03Rating 6Confidence 3

Strengths

- **Potentially Useful Benchmark**: Omni-SafetyBench fills the gap in OLLM safety evaluation by being the first benchmark to focus on audio-visual joint inputs and cross-modal safety consistency. The modality variations (unimodal: text/image/video/audio; dual-modal: image-text/video-text/audio-text; omni-modal: image-audio-text/video-audio-text) provide a comprehensive testbed for OLLMs’ multi-modal safety. - **Comprehension-Aware and Consistency-Focused Metrics**: The proposed Safety-score addr

Weaknesses

**Dependence on Qwen-Plus as Judge Model**: While the authors validate Qwen-Plus’s consistency with human annotators (overall accuracy >0.9), relying on a single closed-source judge model introduces potential bias, as different judge models (e.g., GPT-4o, Claude-3.5) may have varying standards for “comprehension,” “harmful content,” or “refusal,” which could affect Safety-score and CMSC-score calculations. **Incomplete Analysis of Modality-Specific Vulnerabilities**: While the paper identifies

Reviewer 04Rating 2Confidence 3

Strengths

- This is arguably the first benchmark to systematically evaluate OLLM safety across such a wide range of modalities (24 parallel variations), including joint audio-visual inputs. The scale of the benchmark (over 23,000 test instances) is commendable. - The paper correctly identifies that a model's failure to understand a complex multimodal prompt can be mistaken for a successful safety refusal. The proposed Conditional metrics (C-ASR, C-RR) and the resulting Safety-score are a sensible and imp

Weaknesses

- The core weakness is the paper's limited conceptual novelty. The paradigm of creating a safety benchmark by curating or generating harmful prompts is well-established. This work extends this paradigm to more modalities. While a necessary and useful resource, it does not fundamentally change how we approach safety evaluation. The work contributes to the cat-and-mouse dynamic of benchmark creation and model patching, which has questionable long-term scientific value. - The entire benchmark is d

Code & Models

Datasets

Leyiii/Omni-SafetyBench
dataset· 186 dl
186 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.