MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
Weixiang Shen, Chengzhi Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Xiao Han, Zongyue Li, Jingpei Wu, Min Xu, Daguang Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan

TL;DR
This paper introduces MedFlowBench and MedOpenClaw to evaluate medical imaging agents on full-study workflows, emphasizing the importance of producing auditable evidence in clinical settings.
Contribution
It presents a new benchmark and runtime environment for assessing whether medical imaging agents can operate on complete studies and produce verifiable evidence.
Findings
Answer-only scoring overestimates model performance.
Adding image-analysis tools alone does not solve workflow challenges.
Models struggle with input selection, viewer management, and evidence verification.
Abstract
Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
