M2S2L: Mamba-based Multi-Scale Spatial-temporal Learning for Video Anomaly Detection
Yang Liu, Boan Chen, Xiaoguang Zhu, Jing Liu, Peng Sun, Wei Zhou

TL;DR
This paper introduces M2S2L, a hierarchical multi-scale spatial-temporal learning framework for video anomaly detection that balances high accuracy with computational efficiency, suitable for real-time surveillance.
Contribution
It proposes a novel Mamba-based multi-scale spatial-temporal model with feature decomposition for improved behavioral modeling in video anomaly detection.
Findings
Achieves high frame-level AUCs on benchmark datasets.
Maintains real-time inference speed of 45 FPS.
Operates efficiently with 20.1G FLOPs.
Abstract
Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance, which faces fundamental challenges in balancing detection accuracy with computational efficiency. As video content becomes increasingly complex with diverse behavioral patterns and contextual scenarios, traditional VAD approaches struggle to provide robust assessment for modern surveillance systems. Existing methods either lack comprehensive spatial-temporal modeling or require excessive computational resources for real-time applications. In this regard, we present a Mamba-based multi-scale spatial-temporal learning (M2S2L) framework in this paper. The proposed method employs hierarchical spatial encoders operating at multiple granularities and multi-temporal encoders capturing motion dynamics across different time scales. We also introduce a feature decomposition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Video Surveillance and Tracking Methods
