Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis
Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shihui Zhen

TL;DR
Ophiuchus is a versatile framework that enhances medical image analysis by enabling multimodal reasoning with dynamic visual grounding and external tools, leading to improved performance on various benchmarks.
Contribution
It introduces a three-stage training strategy for a tool-augmented medical reasoning model that integrates visual evidence selection, reflection, and reinforcement learning.
Findings
Outperforms state-of-the-art methods on multiple medical benchmarks.
Effectively integrates external tools with inherent model perception.
Demonstrates improved diagnostic reasoning and accuracy.
Abstract
Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy:…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is well-written. 2. The presented method is clear and well-organized. 3. The experiments are extensive.
1. The novelty of the idea is limited. The paper presents several tool-augmented multimodal medical agents but misses many existing general and text-based medical agents augmented with tools, e.g., [1][2]. Meanwhile, the number of incorporated tools (only three) is limited compared to existing works. [1] AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning. arXiv preprint, 2024. [2] RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Predictio
1. This paper proposes a tool-augmented interleaved vision-language reasoning paradigm that enables medical MLLMs to adaptively process localized visual evidence. 2. Proposed Ophiuchus achieve SOTA performance across multiple medical benchmarks.
1. The authors introduce a tool-augmented 'thinking with images' interleaved vision-language reasoning paradigm and a multi-stage training framework. Since these concepts have been widely studied in general-domain MLLMs, could the authors elaborate on the key differences between their approach and existing methods? Are there specific adaptations or modifications tailored to the medical domain? 2. The authors use SAM2 as the image analysis tool. Since SAM2 is not specifically trained on medical d
### **Originality:** The paper introduces a novel framework that creatively combines several existing ideas (tool use, CoT, RL) into a cohesive and new formulation for medical image analysis. The specific three-stage training strategy, particularly the Self-Reflection Fine-Tuning and the fine-grained ATRL reward design, demonstrates significant originality in execution ### **Quality:** The technical quality is high. The experimental design is thorough, the benchmarks are diverse and challengin
Here are the weaknesses in order of importance: ### **Isolating the Effect of Self-Reflection:** While Table 2 validates the contribution of the full three-stage pipeline, the marginal gain from the self-reflection stage (M_cold+reflect vs. M_cold) appears modest. It is unclear whether this gain stems from the proposed self-reflection mechanism itself or simply from the additional two epochs of fine-tuning on a curated subset of data. A stronger ablation would be to compare M_cold+reflect (trai
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
