Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun; Jiashen Hua; Junyi Feng; Xiaojin Gong

arXiv:2602.10549·cs.CV·February 12, 2026

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong

PDF

Open Access

TL;DR

This paper introduces a novel text-guided framework for weakly supervised multimodal video anomaly detection, leveraging in-context learning for text augmentation and a multi-scale Transformer fusion to improve anomaly characterization and reduce false alarms.

Contribution

It proposes a new multi-stage text augmentation method and a multi-scale bottleneck Transformer fusion module to enhance multimodal anomaly detection performance.

Findings

01

Achieves state-of-the-art results on UCF-Crime dataset.

02

Effectively reduces redundancy and imbalance in multimodal fusion.

03

Improves anomaly detection accuracy with enhanced text features.

Abstract

Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Video Analysis and Summarization · Human Pose and Action Recognition