Reinforcement Learning-based Mixture of Vision Transformers for Video   Violence Recognition

Hamid Mohammadi; Ehsan Nazerfard; Tahereh Firoozi

arXiv:2310.03108·cs.CV·October 6, 2023

Reinforcement Learning-based Mixture of Vision Transformers for Video Violence Recognition

Hamid Mohammadi, Ehsan Nazerfard, Tahereh Firoozi

PDF

Open Access

TL;DR

This paper proposes a reinforcement learning-based mixture of vision transformers for video violence recognition, achieving high accuracy while reducing computational costs compared to CNN-based models.

Contribution

It introduces a novel transformer-based Mixture of Experts system that combines large and efficient transformers with reinforcement learning routing for improved accuracy and efficiency.

Findings

01

Achieves 92.4% accuracy on RWF dataset.

02

Outperforms CNN-based models in accuracy.

03

Reduces computational costs through reinforcement learning routing.

Abstract

Video violence recognition based on deep learning concerns accurate yet scalable human violence recognition. Currently, most state-of-the-art video violence recognition studies use CNN-based models to represent and categorize videos. However, recent studies suggest that pre-trained transformers are more accurate than CNN-based models on various video analysis benchmarks. Yet these models are not thoroughly evaluated for video violence recognition. This paper introduces a novel transformer-based Mixture of Experts (MoE) video violence recognition system. Through an intelligent combination of large vision transformers and efficient transformer architectures, the proposed system not only takes advantage of the vision transformer architecture but also reduces the cost of utilizing large vision transformers. The proposed architecture maximizes violence recognition system accuracy while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Dense Connections · Vision Transformer