Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection
Shengyang Sun, Xiaojin Gong

TL;DR
This paper introduces a multi-scale bottleneck transformer approach for weakly supervised multimodal violence detection, effectively addressing challenges like redundancy, imbalance, and asynchrony across modalities, and achieves state-of-the-art results.
Contribution
It proposes a novel MSBT fusion module with bottleneck tokens and a temporal consistency contrast loss for improved multimodal violence detection.
Findings
Achieves state-of-the-art performance on XD-Violence dataset.
Effectively handles modality imbalance and asynchrony.
Outperforms existing methods in weakly supervised settings.
Abstract
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities such as RGB, optical flow, and audio, while only video-level annotations are available. In the pursuit of effective multimodal violence detection (MVD), information redundancy, modality imbalance, and modality asynchrony are identified as three key challenges. In this work, we propose a new weakly supervised MVD method that explicitly addresses these challenges. Specifically, we introduce a multi-scale bottleneck transformer (MSBT) based fusion module that employs a reduced number of bottleneck tokens to gradually condense information and fuse each pair of modalities and utilizes a bottleneck token-based weighting scheme to highlight more important fused features. Furthermore, we propose a temporal consistency contrast loss to semantically align pairwise fused…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Anomaly Detection Techniques and Applications · Network Security and Intrusion Detection
MethodsAttention Is All You Need · Max Pooling · 1x1 Convolution · Residual Connection · Softmax · Pointwise Convolution · Linear Layer · ALIGN · Multi-Head Attention · Bottleneck Transformer Block
