Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection
Hao Shen, Lu Shi, Wanru Xu, Yigang Cen, Linna Zhang, Gaoyun An

TL;DR
This paper introduces a self-supervised vision transformer approach for video anomaly detection that models spatial and temporal relationships between patches, significantly improving detection accuracy over existing methods.
Contribution
A novel two-branch transformer network that decouples inter-patch similarity and order prediction for enhanced video anomaly detection.
Findings
Outperforms pixel-generation-based methods on three benchmarks.
Surpasses other self-supervised learning approaches.
Effectively models spatial and temporal coherence in videos.
Abstract
Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Network Security and Intrusion Detection · Artificial Immune Systems Applications
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer
