LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection
Mitchell Piehl, Muchao Ye

TL;DR
LATERN is a novel framework that improves video anomaly detection by aggregating temporal evidence and providing explainable, event-level decisions using vision-language models.
Contribution
It introduces a context-aware approach with memory and recursive aggregation modules to enhance detection accuracy and explanation coherence in VAD.
Findings
LATERN outperforms existing methods on UCF-Crime and XD-Violence benchmarks.
It produces temporally coherent and semantically grounded explanations.
The framework improves detection accuracy with frozen vision-language models.
Abstract
Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
