TL;DR
CAVE introduces a comprehensive benchmark for real-world visual anomalies, enabling evaluation of vision-language models' ability to detect, describe, and reason about anomalies grounded in human cognition.
Contribution
This work presents the first real-world visual anomaly benchmark with detailed annotations, facilitating research on anomaly detection and commonsense reasoning in vision-language models.
Findings
State-of-the-art VLMs perform poorly on anomaly detection tasks.
CAVE provides fine-grained annotations for anomaly understanding.
Benchmark encourages development of more robust anomaly reasoning models.
Abstract
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
