MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

Wenjie Li; Yujie Zhang; Haoran Sun; Xingqi He; Hongcheng Gao; Chenglong Ma; Ming Hu; Guankun Wang; Shiyi Yao; Renhao Yang; Hongliang Ren; Lei Wang; Junjun He; Yankai Jiang

arXiv:2602.13332·cs.CV·February 17, 2026

MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

Wenjie Li, Yujie Zhang, Haoran Sun, Xingqi He, Hongcheng Gao, Chenglong Ma, Ming Hu, Guankun Wang, Shiyi Yao, Renhao Yang, Hongliang Ren, Lei Wang, Junjun He, Yankai Jiang

PDF

Open Access

TL;DR

MedScope introduces a tool-using, coarse-to-fine reasoning framework for clinical videos, significantly improving the accuracy and trustworthiness of medical video understanding by explicitly grounding predictions in temporally localized visual evidence.

Contribution

This work presents MedScope, a novel model that performs iterative, evidence-guided reasoning with tool calls for clinical videos, and introduces ClinVideoSuite for high-fidelity supervision.

Findings

01

Achieves state-of-the-art performance on clinical video benchmarks.

02

Effectively grounds predictions in temporally localized visual evidence.

03

Outperforms existing models in both in-domain and out-of-domain evaluations.

Abstract

Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning