Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Yue Zhou; Erxuan Wu; Yikang Sun; Hongjoo Lee; Yuan Bi; Huixiong Xu; and Zhongliang Jiang

arXiv:2605.21652·cs.CV·May 22, 2026

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Yue Zhou, Erxuan Wu, Yikang Sun, Hongjoo Lee, Yuan Bi, Huixiong Xu, and Zhongliang Jiang

PDF

TL;DR

This paper introduces a confidence-aware ultrasound VQA framework that mimics sonographers' focus on lesions through active zooming and uncertainty estimation, improving lesion localization accuracy.

Contribution

It proposes a structured Zoom-then-Diagnose paradigm combined with an uncertainty-aware reward within GRPO to enhance model confidence and interpretability in ultrasound diagnosis.

Findings

01

Improved lesion localization by 39.3% across multiple datasets.

02

Model learns to actively focus on lesions before diagnosis.

03

Incorporates uncertainty estimation to handle ambiguous cases.

Abstract

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.