Reinforcement Learning-based Mixture of Vision Transformers for Video Violence Recognition
Hamid Mohammadi, Ehsan Nazerfard, Tahereh Firoozi

TL;DR
This paper proposes a reinforcement learning-based mixture of vision transformers for video violence recognition, achieving high accuracy while reducing computational costs compared to CNN-based models.
Contribution
It introduces a novel transformer-based Mixture of Experts system that combines large and efficient transformers with reinforcement learning routing for improved accuracy and efficiency.
Findings
Achieves 92.4% accuracy on RWF dataset.
Outperforms CNN-based models in accuracy.
Reduces computational costs through reinforcement learning routing.
Abstract
Video violence recognition based on deep learning concerns accurate yet scalable human violence recognition. Currently, most state-of-the-art video violence recognition studies use CNN-based models to represent and categorize videos. However, recent studies suggest that pre-trained transformers are more accurate than CNN-based models on various video analysis benchmarks. Yet these models are not thoroughly evaluated for video violence recognition. This paper introduces a novel transformer-based Mixture of Experts (MoE) video violence recognition system. Through an intelligent combination of large vision transformers and efficient transformer architectures, the proposed system not only takes advantage of the vision transformer architecture but also reduces the cost of utilizing large vision transformers. The proposed architecture maximizes violence recognition system accuracy while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Dense Connections · Vision Transformer
