From Evaluation to Defense: Advancing Safety in Video Large Language Models
Yiwei Sun, Peiqi Jiang, Chuanbin Liu, Luohao Lin, Zhiying Lu, Hongtao Xie

TL;DR
This paper introduces a large-scale benchmark and a novel framework to evaluate and improve safety in Video Large Language Models, addressing systemic risks and demonstrating significant safety performance gains.
Contribution
The paper presents VideoSafetyEval, a comprehensive safety benchmark for Video LLMs, and proposes VideoSafety-R1, a dual-stage framework with innovative safety fine-tuning and reasoning techniques.
Findings
Video modality integration degrades safety by 34.2% on average.
VideoSafety-R1 achieves 71.1% improvement on VSE-HH.
Significant safety improvements on multiple datasets.
Abstract
While the safety risks of image-based large language models (Image LLMs) have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce VideoSafetyEval - a large-scale, real-world benchmark for Video LLM safety, which comprises 11.4k video-query pairs and spans 19 principal risk categories. Based on this, we reveal that integrating video modality degrades safety performance by an average of 34.2%, thereby exposing systemic risks in multimodal attack exploitation. To address this vulnerability, we propose VideoSafety-R1, a dual-stage framework achieving unprecedented safety gains through three innovations: (1) the VideoSafetyThinking dataset contains 46k video-query-thinking response triplets; (2) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper makes substantial experiments. Authors evaluate a wide range of publicly available video LLMs in exactly the challenging setting they care about and they do so on a balanced benchmark that separates harmful video plus harmful query (VSE HH), safe video plus harmful query (VSE SH) and safe query for false refusals. 2. This paper is well-written and organized. The presentation of this paper is clear. 3. The task in this paper focus on safety issue of video LLM, which is important a
I'd like to increase my rate if author can address my following concerns: 1. A large part of the safety gains is established with a model in the loop evaluator (Qwen based). The policy which indicates the definition of harmfulness is missing and so the definition of harmfulness is fully dependent on Qwen model, which may limit the future research. 2. The base model used in ablation is unknown. It's possible gain of each component is different across different base model. 3. The paper claims t
Clear evidence that video hurts safety: The paper provides quantitative analyses showing that integrating video can sharply reduce DSR relative to text-only scenes (e.g., −79.4% for VideoLLaMA3-2B), and that harmful videos amplify adversarial effectiveness. This is an important and under-explored finding that justifies a video-specific safety agenda. Well-designed benchmark and pipeline: VSB-77k is large, multilingual, and aligned to platform policies. The construction pipeline that consists of
No coverage of “dynamic adversarial attacks”: Although the task is video, the method/evaluation effectively treats it as a set of sampled frames, rather than modeling sequence-level risk. The benchmark’s “harmful video” cases are largely explicitly harmful (e.g., direct fights, weapon displays). It does not include more realistic implicit dynamic attacks, for example, first ~10 frames benign, later ~20 frames contain fragmented harmful content or the video shows a seemingly legal tool but frame
S1: The VideoSafetyThinking dataset could be a nice contribution to the literature. If the data is of sufficient quality and interesting analyses about it can be performed, it could become a key aspect of discussion in the paper. S2: The proposed modifications are simple and, although not particularly innovative, seem effective. S3: In general, the idea of using reasoning data and applying RL/tokenization can be interesting
W1: The claim to be the 'first large-scale, real-world benchmark for Video LLM safety' is a bit of an overstatement. The authors themselves, do cite competing (allegedly concurrent) benchmarks, besides the existence of [1,2,3]. Similarly the claim that the video modality introduces a vulnerability is not particularly novel: any form of finetuning (even on few benign samples in the same modality, even worse when a new modality is added) reduces the effectiveness of safety finetuning procedures.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
