Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models
Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

TL;DR
This paper critiques current video anomaly detection methods for overgeneralizing across scenes and emphasizes the need for scene-specific, spatially-aware, and explainable models to better capture normal behavior.
Contribution
It highlights the limitations of existing multi-scene, weakly supervised approaches and advocates for a renewed focus on single-scene, spatially-aware anomaly detection models.
Findings
Current models often respond to semantic anomalies rather than deviations from normality.
Multi-scene models suppress spatial localization and introduce semantic bias.
Single-scene, explainable models better capture local normality patterns.
Abstract
Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
