Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Shanle Yao; Armin Danesh Pazho; Narges Rashvand; Hamed Tabkhi

arXiv:2603.04727·cs.CV·May 19, 2026

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

PDF

TL;DR

This paper evaluates multimodal large language models for real-world video anomaly detection, revealing their conservative bias and limited recall in zero-shot settings, and explores prompting strategies to improve performance.

Contribution

It systematically assesses MLLMs on VAD benchmarks, highlighting their biases and proposing class-specific prompts to enhance detection metrics.

Findings

01

High confidence but low recall in zero-shot MLLMs for VAD

02

Class-specific instructions significantly improve F1-score

03

Performance gap identified in noisy, real-world environments

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning