Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance
Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong

TL;DR
This paper introduces a novel text-guided framework for weakly supervised multimodal video anomaly detection, leveraging in-context learning for text augmentation and a multi-scale Transformer fusion to improve anomaly characterization and reduce false alarms.
Contribution
It proposes a new multi-stage text augmentation method and a multi-scale bottleneck Transformer fusion module to enhance multimodal anomaly detection performance.
Findings
Achieves state-of-the-art results on UCF-Crime dataset.
Effectively reduces redundancy and imbalance in multimodal fusion.
Improves anomaly detection accuracy with enhanced text features.
Abstract
Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Video Analysis and Summarization · Human Pose and Action Recognition
