Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
Dasol Choi, Seunghyun Lee, Youngsook Song

TL;DR
This paper evaluates vision-language models in safety-critical scenarios, revealing a prevalent overreaction bias where models often misclassify safe situations as emergencies, highlighting the need for improved contextual understanding.
Contribution
Introduces VERI, a benchmark for assessing VLMs' reliability in emergency recognition, and uncovers systematic overreaction issues affecting safety-critical applications.
Findings
Models have high recall but low precision in emergency detection.
All models misclassified seven safe scenarios as dangerous.
Overinterpretation of context causes 88-98% of errors.
Abstract
Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical scenarios remains insufficiently explored. We introduce VERI, a diagnostic benchmark comprising 200 synthetic images (100 contrastive pairs) and an additional 50 real-world images (25 pairs) for validation. Each emergency scene is paired with a visually similar but safe counterpart through human verification. Using a two-stage evaluation protocol (risk identification and emergency response), we assess 17 VLMs across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem": models achieve high recall (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. This "better-safe-than-sorry" bias stems from contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
