PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding
I\~naki Erregue, Kamal Nasrollahi, Sergio Escalera

TL;DR
PrismVAU is a lightweight, real-time system for video anomaly understanding that uses a single off-the-shelf multimodal large language model to detect, explain, and refine anomalies efficiently without extensive annotations or external modules.
Contribution
The paper introduces PrismVAU, a novel system that simplifies and accelerates multimodal video anomaly understanding using prompt optimization and a two-stage approach with minimal supervision.
Findings
Competitive detection performance on standard benchmarks
Provides interpretable anomaly explanations
Operates efficiently without external modules or dense processing
Abstract
Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Video Analysis and Summarization
