MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
Hanqi Jiang, Junhao Chen, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li

TL;DR
MedVIGIL introduces a comprehensive evaluation suite for medical vision--language models, focusing on their ability to recognize and refuse to answer when visual evidence is compromised, ensuring trustworthy clinical deployment.
Contribution
This work presents a new, clinician-supervised benchmark with 2,556 probes to assess medical VLMs' robustness against evidence failures, including a composite score and public release.
Findings
Radiologist scores show an 83.3 MCS with 5.8% silent failures.
Strongest model achieves 69.2 MCS, leaving room for improvement.
Benchmark and evaluation tools are publicly available.
Abstract
Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
