SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
Jongmin Shin, Ka Young Kim, Eunki Cho, Seong Tae Kim, Namkee Oh

TL;DR
SurgCheck is a diagnostic benchmark that reveals vision-language models in surgical VQA rely heavily on linguistic shortcuts rather than visual understanding, exposing limitations of current performance metrics.
Contribution
The paper introduces SurgCheck, a novel paired-question benchmark with grounding cues to quantify linguistic shortcut reliance in surgical VQA models.
Findings
Models show performance drops on less-biased questions, indicating reliance on linguistic shortcuts.
Text-only ablations suggest visual reasoning is minimal for action and target prediction.
SurgCheck exposes that high benchmark scores may not reflect true visual understanding.
Abstract
Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
