Evaluation of Vision-LLMs in Surveillance Video
Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense

TL;DR
This paper evaluates the ability of vision-language models to detect anomalies in surveillance videos using zero-shot, language-grounded methods, highlighting current strengths and limitations in spatial reasoning and privacy-preserving scenarios.
Contribution
It introduces a zero-shot, language-grounded approach for anomaly detection in videos using vision-language models and evaluates their performance on public datasets.
Findings
Models perform well on simple, spatially salient events.
Performance drops with noisy spatial cues and identity obfuscation.
Few-shot learning can improve accuracy but may increase false positives.
Abstract
The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
