Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang

TL;DR
This paper introduces a novel modality-aware contrastive learning approach with self-distillation for weakly-supervised audio-visual violence detection, effectively addressing modality heterogeneity and improving detection accuracy.
Contribution
It proposes MACIL-SD, a new framework that clusters unimodal instances, uses contrastive learning, and applies self-distillation to enhance weakly-supervised audio-visual violence detection.
Findings
Outperforms previous methods on XD-Violence dataset
Reduces model complexity while improving accuracy
Can serve as a plug-in to enhance other networks
Abstract
Weakly-supervised audio-visual violence detection aims to distinguish snippets containing multimodal violence events with video-level labels. Many prior works perform audio-visual integration and interaction in an early or intermediate manner, yet overlooking the modality heterogeneousness over the weakly-supervised setting. In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning. To address these issues, we propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way. Then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Digital Media Forensic Detection · Music and Audio Processing
