Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu; Xinyang Han; Jia Ai; Juan Wu; Tong Zhao; Tonghuan Xiao; Sheng Ning; Yuqi Yang; Jing Qin; Ann Dorothy King; Winnie Chiu-Wing Chu; Jing Cai; Michael Tin-Cheung Ying

arXiv:2506.08849·cs.CV·May 5, 2026

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu, Xinyang Han, Jia Ai, Juan Wu, Tong Zhao, Tonghuan Xiao, Sheng Ning, Yuqi Yang, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying

PDF

1 Repo

TL;DR

This paper introduces a hybrid tuning strategy to adapt vision-language models for ultrasound analysis, effectively handling ultrasound-specific noise and artifacts, and demonstrating superior performance across multiple datasets.

Contribution

A novel hybrid tuning method that preserves pre-trained model knowledge while incorporating ultrasound-specific modules for improved analysis.

Findings

01

Significantly outperforms existing adapters and VLFMs in segmentation and classification.

02

Exhibits high data efficiency in few-shot learning scenarios.

03

Shows robust generalization across different datasets.

Abstract

Vision-Language Foundation Models (VLFMs) exhibit remarkable generalization, yet their direct application to medical ultrasound is severely hindered by a profound modality gap. The unique acoustic physics of ultrasound, characterized by speckle noise, shadowing, and heterogeneous textures, often degrades the performance of off-the-shelf VLFMs. To bridge this gap, we propose a novel Hybrid Tuning (HT) strategy for the parameter-efficient adaptation of CLIP-based models to ultrasound analysis. Instead of updating the pre-trained weights, HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations. Extensive evaluations across six multi-center datasets demonstrate that our HT-enhanced models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jinggqu/NextGen-UIA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.