TL;DR
This paper demonstrates that simple encoder-only Video Vision Transformers, when properly pre-trained, can effectively and efficiently perform traffic anomaly detection, challenging the need for complex architectures.
Contribution
It shows that advanced pre-training enables simple models to outperform complex methods in traffic anomaly detection, emphasizing the importance of pre-training strategies.
Findings
Pre-trained simple models match or surpass complex state-of-the-art methods.
Self-supervised Masked Video Modeling is most effective for TAD.
Domain-Adaptive Pre-Training improves downstream performance without labeled anomalies.
Abstract
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
