Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Damith Chamalke Senadeera, Xiaoyun Yang, Shibo Li, Muhammad Awais, Dimitrios Kollias, Gregory Slabaugh

TL;DR
This paper introduces Dual Branch VideoMamba with Gated Class Token Fusion, an efficient model combining spatial and temporal features for violence detection, achieving state-of-the-art results on a new comprehensive benchmark.
Contribution
It presents a novel dual-branch architecture with gated fusion and a new benchmark dataset for violence detection, improving accuracy and efficiency.
Findings
Achieves state-of-the-art performance on the new benchmark
Balances accuracy and computational efficiency effectively
Demonstrates the effectiveness of SSMs for real-time surveillance
Abstract
The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics. The model performs continuous fusion via a gating mechanism between the branches to enhance the model's ability to detect violent activities even in challenging surveillance scenarios. We also present a new benchmark by merging RWF-2000, RLVS, SURV and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
