MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, and Tajamul Ashraf

TL;DR
MedSPOT introduces a realistic, workflow-aware benchmark for evaluating multimodal models' ability to perform sequential visual grounding in clinical GUI environments, emphasizing error propagation and diagnostic failure analysis.
Contribution
It presents MedSPOT, a novel benchmark with a sequential evaluation protocol and failure taxonomy tailored for medical GUI grounding tasks, addressing limitations of prior isolated, single-step benchmarks.
Findings
Benchmark captures complex medical workflows and interface hierarchies.
Sequential evaluation protocol measures error propagation.
Failure taxonomy enables systematic diagnosis of model errors.
Abstract
Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
