MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, and Muhammad Haris Khan

TL;DR
MedObvious introduces a comprehensive benchmark to evaluate the critical pre-diagnostic sanity checks in medical vision-language models, revealing significant reliability issues in current models' ability to verify input consistency and safety.
Contribution
The paper presents MedObvious, a novel benchmark with 1,880 tasks for assessing input validation in medical VLMs, highlighting the need for safety-critical verification capabilities.
Findings
Models often hallucinate anomalies on normal inputs.
Performance drops with larger image sets.
Accuracy varies between multiple-choice and open-ended formats.
Abstract
Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
