VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
Ying Cheng, Yu-Ho Lin, Min-Hung Chen, Fu-En Yang, Shang-Hong Lai

TL;DR
VADER introduces a relation-aware large language model framework for video anomaly understanding, integrating object relations and causal reasoning to produce detailed, explainable descriptions of anomalous events in videos.
Contribution
The paper presents VADER, a novel LLM-driven framework that incorporates object relations and causal context for enhanced video anomaly understanding and explanation.
Findings
VADER outperforms existing methods on multiple VAU benchmarks.
It provides detailed, causally grounded descriptions of anomalies.
VADER effectively supports anomaly explanation and question answering.
Abstract
Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Multimodal Machine Learning Applications
