A Benchmark for Crime Surveillance Video Analysis with Large Models
Haoran Chen, Dong Yi, Moyan Cao, Chensen Huang, Guibo Zhu, Jinqiao, Wang

TL;DR
This paper introduces a new benchmark dataset, UCVL, for crime surveillance video analysis using large multimodal language models, and evaluates their performance with detailed assessments and fine-tuning.
Contribution
It provides a comprehensive benchmark with diverse QA pairs and assessment methods for MLLMs in crime video analysis, filling a gap in current evaluation standards.
Findings
MLLMs show varying performance on the benchmark
Fine-tuning improves model accuracy in anomaly detection
The benchmark is reliable for evaluating large models' capabilities
Abstract
Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis
