CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar; Syrielle Montariol; Angelika Romanou; Beatriz Borges; Irina Rish; Antoine Bosselut

arXiv:2510.26006·cs.CV·October 31, 2025

CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut

PDF

1 Video

TL;DR

CAVE introduces a comprehensive benchmark for real-world visual anomalies, enabling evaluation of vision-language models' ability to detect, describe, and reason about anomalies grounded in human cognition.

Contribution

This work presents the first real-world visual anomaly benchmark with detailed annotations, facilitating research on anomaly detection and commonsense reasoning in vision-language models.

Findings

01

State-of-the-art VLMs perform poorly on anomaly detection tasks.

02

CAVE provides fine-grained annotations for anomaly understanding.

03

Benchmark encourages development of more robust anomaly reasoning models.

Abstract

Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments· underline