Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Federico Felizzi; Olivia Riccomi; Michele Ferramola; Francesco Andrea Causio; Manuel Del Medico; Vittorio De Vita; Lorenzo De Mori; Alessandra Piscitelli; Pietro Eric Risuleo; Bianca Destro Castaniti; Antonio Cristiano; Alessia Longo; Luigi De Angelis; Mariapia Vassalli; Marcello Di Pumpo

arXiv:2511.19220·cs.CV·December 1, 2025

Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Federico Felizzi, Olivia Riccomi, Michele Ferramola, Francesco Andrea Causio, Manuel Del Medico, Vittorio De Vita, Lorenzo De Mori, Alessandra Piscitelli, Pietro Eric Risuleo, Bianca Destro Castaniti, Antonio Cristiano, Alessia Longo, Luigi De Angelis, Mariapia Vassalli

PDF

Open Access

TL;DR

This study evaluates whether large vision language models genuinely understand medical images in Italian clinical question answering, revealing significant variability in visual grounding and robustness among models.

Contribution

It provides the first systematic assessment of visual grounding in state-of-the-art medical VLMs using a novel diagnostic approach with image placeholders.

Findings

01

GPT-4o shows the strongest visual grounding with a 27.9 percentage point accuracy drop.

02

GPT-5-mini, Gemini, and Claude maintain high accuracy with minimal drops.

03

Models often generate confident explanations for fabricated visual content.

Abstract

Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)