SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

Bo Liu; Pengfei Qiao; Minhan Ma; Xuange Zhang; Yinan Tang; Peng Xu; Kun Liu; Tongtong Yuan

arXiv:2505.12589·cs.CV·May 20, 2025

SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yinan Tang, Peng Xu, Kun Liu, Tongtong Yuan

PDF

Open Access 4 Reviews

TL;DR

SurveillanceVQA-589K is a large, comprehensive benchmark dataset designed to evaluate vision-language models on complex surveillance video understanding tasks, highlighting current model limitations in real-world scenarios.

Contribution

The paper introduces the largest surveillance-specific video question answering dataset with a novel hybrid annotation pipeline and a multi-dimensional evaluation protocol.

Findings

01

Current LVLMs show significant performance gaps in causal and anomaly tasks.

02

The benchmark reveals limitations of existing models in real-world surveillance understanding.

03

SurveillanceVQA-589K enables targeted improvements in safety-critical video-language applications.

Abstract

Understanding surveillance video content remains a critical yet underexplored challenge in vision-language research, particularly due to its real-world complexity, irregular event dynamics, and safety-critical implications. In this work, we introduce SurveillanceVQA-589K, the largest open-ended video question answering benchmark tailored to the surveillance domain. The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types, including temporal reasoning, causal inference, spatial understanding, and anomaly interpretation, across both normal and abnormal video scenarios. To construct the benchmark at scale, we design a hybrid annotation pipeline that combines temporally aligned human-written captions with Large Vision-Language Model-assisted QA generation using prompt-based techniques. We also propose a multi-dimensional evaluation protocol to assess contextual,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

A key strength of this paper is its attempt to extend video question answering into the practically important domain of surveillance. The proposed large-scale dataset and the use of open-ended QA with a hybrid generation strategy reflect a degree of originality in task formulation. The paper is well structured, and the methodology and evaluation framework are clearly described. The experimental section includes multiple mainstream multimodal models, offering a broad view of current model perform

Weaknesses

This paper has several notable weaknesses. First, the originality and incremental contribution are limited. The main differences from existing surveillance QA datasets, such as UCA, lie primarily in scaling up the dataset and changing the evaluation protocol, rather than introducing any fundamental advances in task design or domain modeling. Second, the open-ended QA format places the task awkwardly between captioning and traditional VQA, failing to provide either the strict semantic alignme

Reviewer 02Rating 2Confidence 4

Strengths

1. It constructs a large-scale dataset for comprehensive cross-modal surveillance video understanding. The dataset could be of significant value for social security. 2. It provides well-classified questions, which would benefit model analysis. 3. It provides results of SOTA LVLMs, analyze their success and failure points, and gives suggestions for future research.

Weaknesses

1. The dataset seems to be biased to LLaVA and Qwen series of modes, since LLaVA-Video and Qwen-Max are employed for caption and QA generation respectively. The results in Table 3 and 4 show that LLaVA-Video outperform other opensource models, which further confirms my concern. 2. The QAs are directly generated by Qwen-max, without further human checking for answer correctness. This raises concern about QA quality. 3. Some basic data statistics should be moved to the main text, for example, what

Reviewer 03Rating 4Confidence 3

Strengths

- Large scale of this dataset compared to prior work - Evaluation including both multiple open source and proprietary models

Weaknesses

- Simple baselines like text-only, single-image and human evaluation are not presented and would help assessing the benchmark. - The fact that fine-tuning on the data does not help much is worrying and could suggest issue in the generated data (e.g. lack of diversity of the generated QAs).

Reviewer 04Rating 4Confidence 2

Strengths

1. The paper is clearly written and well-organized. 2. The proposed SurveillanceVQA-589K benchmark seems to extend beyond simple descriptive tasks to include higher-level reasoning such as logical reasoning, causal inference, and complex semantic comprehension of video content. 3. The authors perform a comprehensive experimental evaluation using a diverse set of LVLMs, providing valuable insights into their performance and limitations in surveillance-related scenarios.

Weaknesses

1. Although the benchmark includes diverse question types such as causal, temporal, and spatial reasoning, these dimensions have already been explored in prior video understanding benchmarks (e.g., VideoMME [1], MVBench [2]). The main novelty thus lies primarily in the surveillance domain rather than in the reasoning taxonomy itself, which somewhat limits the originality of the contribution. To enhance its distinctiveness, the benchmark could introduce evaluation dimensions specifically tailored

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning