MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Rozain Shakeel; Abdul Rahman Mohammad Ali; Muneeb Mushtaq; Tausifa Jan Saleem; and Tajamul Ashraf

arXiv:2603.19993·cs.CV·March 23, 2026

MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, and Tajamul Ashraf

PDF

Open Access 1 Datasets

TL;DR

MedSPOT introduces a realistic, workflow-aware benchmark for evaluating multimodal models' ability to perform sequential visual grounding in clinical GUI environments, emphasizing error propagation and diagnostic failure analysis.

Contribution

It presents MedSPOT, a novel benchmark with a sequential evaluation protocol and failure taxonomy tailored for medical GUI grounding tasks, addressing limitations of prior isolated, single-step benchmarks.

Findings

01

Benchmark captures complex medical workflows and interface hierarchies.

02

Sequential evaluation protocol measures error propagation.

03

Failure taxonomy enables systematic diagnosis of model errors.

Abstract

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Tajamul21/MedSPOT
dataset· 525 dl
525 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education