From Evaluation to Defense: Advancing Safety in Video Large Language Models

Yiwei Sun; Peiqi Jiang; Chuanbin Liu; Luohao Lin; Zhiying Lu; Hongtao Xie

arXiv:2505.16643·cs.CV·March 17, 2026

From Evaluation to Defense: Advancing Safety in Video Large Language Models

Yiwei Sun, Peiqi Jiang, Chuanbin Liu, Luohao Lin, Zhiying Lu, Hongtao Xie

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a large-scale benchmark and a novel framework to evaluate and improve safety in Video Large Language Models, addressing systemic risks and demonstrating significant safety performance gains.

Contribution

The paper presents VideoSafetyEval, a comprehensive safety benchmark for Video LLMs, and proposes VideoSafety-R1, a dual-stage framework with innovative safety fine-tuning and reasoning techniques.

Findings

01

Video modality integration degrades safety by 34.2% on average.

02

VideoSafety-R1 achieves 71.1% improvement on VSE-HH.

03

Significant safety improvements on multiple datasets.

Abstract

While the safety risks of image-based large language models (Image LLMs) have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce VideoSafetyEval - a large-scale, real-world benchmark for Video LLM safety, which comprises 11.4k video-query pairs and spans 19 principal risk categories. Based on this, we reveal that integrating video modality degrades safety performance by an average of 34.2%, thereby exposing systemic risks in multimodal attack exploitation. To address this vulnerability, we propose VideoSafety-R1, a dual-stage framework achieving unprecedented safety gains through three innovations: (1) the VideoSafetyThinking dataset contains 46k video-query-thinking response triplets; (2) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper makes substantial experiments. Authors evaluate a wide range of publicly available video LLMs in exactly the challenging setting they care about and they do so on a balanced benchmark that separates harmful video plus harmful query (VSE HH), safe video plus harmful query (VSE SH) and safe query for false refusals. 2. This paper is well-written and organized. The presentation of this paper is clear. 3. The task in this paper focus on safety issue of video LLM, which is important a

Weaknesses

I'd like to increase my rate if author can address my following concerns: 1. A large part of the safety gains is established with a model in the loop evaluator (Qwen based). The policy which indicates the definition of harmfulness is missing and so the definition of harmfulness is fully dependent on Qwen model, which may limit the future research. 2. The base model used in ablation is unknown. It's possible gain of each component is different across different base model. 3. The paper claims t

Reviewer 02Rating 8Confidence 4

Strengths

Clear evidence that video hurts safety: The paper provides quantitative analyses showing that integrating video can sharply reduce DSR relative to text-only scenes (e.g., −79.4% for VideoLLaMA3-2B), and that harmful videos amplify adversarial effectiveness. This is an important and under-explored finding that justifies a video-specific safety agenda. Well-designed benchmark and pipeline: VSB-77k is large, multilingual, and aligned to platform policies. The construction pipeline that consists of

Weaknesses

No coverage of “dynamic adversarial attacks”: Although the task is video, the method/evaluation effectively treats it as a set of sampled frames, rather than modeling sequence-level risk. The benchmark’s “harmful video” cases are largely explicitly harmful (e.g., direct fights, weapon displays). It does not include more realistic implicit dynamic attacks, for example, first ~10 frames benign, later ~20 frames contain fragmented harmful content or the video shows a seemingly legal tool but frame

Reviewer 03Rating 2Confidence 4

Strengths

S1: The VideoSafetyThinking dataset could be a nice contribution to the literature. If the data is of sufficient quality and interesting analyses about it can be performed, it could become a key aspect of discussion in the paper. S2: The proposed modifications are simple and, although not particularly innovative, seem effective. S3: In general, the idea of using reasoning data and applying RL/tokenization can be interesting

Weaknesses

W1: The claim to be the 'first large-scale, real-world benchmark for Video LLM safety' is a bit of an overstatement. The authors themselves, do cite competing (allegedly concurrent) benchmarks, besides the existence of [1,2,3]. Similarly the claim that the video modality introduces a vulnerability is not particularly novel: any form of finetuning (even on few benign samples in the same modality, even worse when a new modality is added) reduces the effectiveness of safety finetuning procedures.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning