Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
Sakshi Agarwal, Aishik Konwer, Ankit Parag Shah

TL;DR
VANGUARD introduces a unified framework combining anomaly detection, spatial grounding, and reasoning in video analysis using multimodal large language models, achieving high accuracy and interpretability.
Contribution
It presents a novel three-stage training curriculum and a teacher-student annotation pipeline to enhance video anomaly detection with reasoning and grounding capabilities.
Findings
Achieves 94% ROC-AUC and 84% F1 on UCF-Crime.
Produces interpretable chain-of-thought explanations and spatial grounding.
Demonstrates strong zero-shot transfer to other datasets.
Abstract
Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects. We propose VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning within a single VLM. VANGUARD introduces a three-stage curriculum that progressively layers training objectives: (1) classifier warmup on frozen backbone features, (2) LoRA-adapted spatial grounding, and (3) chain-of-thought generation. To overcome the sparse annotation typical of VAD benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
