SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models
Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yinan Tang, Peng Xu, Kun Liu, Tongtong Yuan

TL;DR
SurveillanceVQA-589K is a large, comprehensive benchmark dataset designed to evaluate vision-language models on complex surveillance video understanding tasks, highlighting current model limitations in real-world scenarios.
Contribution
The paper introduces the largest surveillance-specific video question answering dataset with a novel hybrid annotation pipeline and a multi-dimensional evaluation protocol.
Findings
Current LVLMs show significant performance gaps in causal and anomaly tasks.
The benchmark reveals limitations of existing models in real-world surveillance understanding.
SurveillanceVQA-589K enables targeted improvements in safety-critical video-language applications.
Abstract
Understanding surveillance video content remains a critical yet underexplored challenge in vision-language research, particularly due to its real-world complexity, irregular event dynamics, and safety-critical implications. In this work, we introduce SurveillanceVQA-589K, the largest open-ended video question answering benchmark tailored to the surveillance domain. The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types, including temporal reasoning, causal inference, spatial understanding, and anomaly interpretation, across both normal and abnormal video scenarios. To construct the benchmark at scale, we design a hybrid annotation pipeline that combines temporally aligned human-written captions with Large Vision-Language Model-assisted QA generation using prompt-based techniques. We also propose a multi-dimensional evaluation protocol to assess contextual,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
A key strength of this paper is its attempt to extend video question answering into the practically important domain of surveillance. The proposed large-scale dataset and the use of open-ended QA with a hybrid generation strategy reflect a degree of originality in task formulation. The paper is well structured, and the methodology and evaluation framework are clearly described. The experimental section includes multiple mainstream multimodal models, offering a broad view of current model perform
This paper has several notable weaknesses. First, the originality and incremental contribution are limited. The main differences from existing surveillance QA datasets, such as UCA, lie primarily in scaling up the dataset and changing the evaluation protocol, rather than introducing any fundamental advances in task design or domain modeling. Second, the open-ended QA format places the task awkwardly between captioning and traditional VQA, failing to provide either the strict semantic alignme
1. It constructs a large-scale dataset for comprehensive cross-modal surveillance video understanding. The dataset could be of significant value for social security. 2. It provides well-classified questions, which would benefit model analysis. 3. It provides results of SOTA LVLMs, analyze their success and failure points, and gives suggestions for future research.
1. The dataset seems to be biased to LLaVA and Qwen series of modes, since LLaVA-Video and Qwen-Max are employed for caption and QA generation respectively. The results in Table 3 and 4 show that LLaVA-Video outperform other opensource models, which further confirms my concern. 2. The QAs are directly generated by Qwen-max, without further human checking for answer correctness. This raises concern about QA quality. 3. Some basic data statistics should be moved to the main text, for example, what
- Large scale of this dataset compared to prior work - Evaluation including both multiple open source and proprietary models
- Simple baselines like text-only, single-image and human evaluation are not presented and would help assessing the benchmark. - The fact that fine-tuning on the data does not help much is worrying and could suggest issue in the generated data (e.g. lack of diversity of the generated QAs).
1. The paper is clearly written and well-organized. 2. The proposed SurveillanceVQA-589K benchmark seems to extend beyond simple descriptive tasks to include higher-level reasoning such as logical reasoning, causal inference, and complex semantic comprehension of video content. 3. The authors perform a comprehensive experimental evaluation using a diverse set of LVLMs, providing valuable insights into their performance and limitations in surveillance-related scenarios.
1. Although the benchmark includes diverse question types such as causal, temporal, and spatial reasoning, these dimensions have already been explored in prior video understanding benchmarks (e.g., VideoMME [1], MVBench [2]). The main novelty thus lies primarily in the surveillance domain rather than in the reasoning taxonomy itself, which somewhat limits the originality of the contribution. To enhance its distinctiveness, the benchmark could introduce evaluation dimensions specifically tailored
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning
