FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding
Jo\~ao Pereira, Vasco Lopes, Jo\~ao Neves, David Semedo

TL;DR
This paper introduces FineVAU, a new benchmark for detailed video anomaly understanding, featuring a novel evaluation metric and dataset to better assess human-aligned, fine-grained analysis of unusual video events.
Contribution
The paper proposes a comprehensive benchmark with a new human-aligned evaluation metric and a high-quality dataset to improve fine-grained video anomaly understanding.
Findings
FVScore aligns better with human perception than existing metrics.
LVLMs struggle with spatial and temporal details in anomalies.
FineVAU reveals limitations of current models in detailed anomaly comprehension.
Abstract
Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Video Analysis and Summarization
