MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

Mahtab Bigverdi; Wisdom Ikezogwo; Kevin Zhang; Hyewon Jeong; Mingyu Lu; Sungjae Cho; Linda Shapiro; Ranjay Krishna

arXiv:2508.02951·cs.AI·August 6, 2025

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

Mahtab Bigverdi, Wisdom Ikezogwo, Kevin Zhang, Hyewon Jeong, Mingyu Lu, Sungjae Cho, Linda Shapiro, Ranjay Krishna

PDF

1 Datasets 3 Reviews

TL;DR

MedBLINK is a new benchmark that evaluates multimodal language models on basic perceptual tasks in medical imaging, revealing significant performance gaps that hinder clinical adoption.

Contribution

This paper introduces MedBLINK, a comprehensive benchmark for assessing perceptual abilities of medical multimodal models across multiple tasks and modalities.

Findings

01

Human accuracy is 96.4%, while top models reach only 65%.

02

Current models often fail basic perceptual tasks in medical imaging.

03

Results highlight the need for improved visual grounding in models.

Abstract

Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4%…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

**Motivation**: AI models are more critical in medical field. The authors motivate the benchmark by arguing that clinicians will not trust a model that cannot solve simple perceptual tasks. By probing these “blink tasks”, MedBLINK assesses whether MLMs truly “see” the image or exploit superficial correlations. This focus on trustworthiness is reasonable especially as MLMs are being considered for clinical decision support. **Task design**: The eight tasks are designed simple. They can be extrac

Weaknesses

1. **Ambiguity in Perception**: The core idea of the paper relies on a clean distinction between basic visual perception and complex reasoning, but some tasks go beyond simple perception. In my understanding, a basic perceptual task should be easy for any medically trained person to recognize. I agree those grounding-related tasks are simple for most people, like visual depth estimation, wave-based imaging depth estimation, histology structure, imaging orientation, and relative position. However

Reviewer 02Rating 4Confidence 3

Strengths

Extensive Benchmarking: The paper thoroughly evaluates multiple models on a diverse set of medical imaging tasks, offering a clear comparative analysis. In-Depth Analysis: The discussion goes beyond mere performance metrics, providing insightful observations into model behaviors, strengths, and weaknesses. Valuable Conclusions: The findings offer practical guidance and highlight important challenges in the application of foundation models to medical vision tasks.

Weaknesses

Clarification on Human Benchmark: The paper states that "human annotators achieve 96.4% accuracy." This metric is crucial as a performance ceiling, but several details require clarification to fully interpret this benchmark: - Expertise Level: What was the expertise level of these annotators (e.g., board-certified radiologists, resident physicians, or medical students)? The performance gap between a model and a human can be interpreted very differently based on this. - Ground Truth Adjudicatio

Reviewer 03Rating 2Confidence 3

Strengths

The paper is well-motivated, addressing basic perceptual competence. The benchmark is clearly structured, covering multiple imaging modalities and clinically relevant perceptual subtasks with expert validation. The experimental section is extensive, comparing a diverse set of 20 MLLMs and including human and CNN baselines.

Weaknesses

The main limitation lies in novelty. Similar perceptual or visual question-answering benchmarks already exist, such as MedFrameQA, and MedTrinity-25M, and MedBLINK appears to extend these ideas into the medical domain without introducing fundamentally new methods or task formulations. It lacks open-ended evaluation which is critical for real life clinical use. Several tasks, such as determining whether an X-ray is upside down, seem disconnected from real clinical practice and may not provide mea

Code & Models

Datasets

MahtabBg/MedBLINK
dataset· 54 dl
54 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.