Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
Jiayun Jin, Haolong Chai, Xueying Huang, Xiaoqing Guo, Zengwei Zheng, Zhan Zhou, Junmei Wang, Xinyu Wang, Jie Liu, Binbin Zhou

TL;DR
Ultrasound-CLIP introduces a semantic-aware contrastive pre-training method tailored for ultrasound image-text understanding, leveraging a large dataset and hierarchical taxonomies to improve diagnostic classification and retrieval.
Contribution
The paper presents Ultrasound-CLIP, a novel contrastive learning framework specifically designed for ultrasound data, incorporating semantic soft labels and structured reasoning over lesion attributes.
Findings
Achieves state-of-the-art performance on classification and retrieval benchmarks.
Demonstrates strong generalization in zero-shot, linear probing, and fine-tuning tasks.
Constructs US-365K, a large-scale ultrasound image-text dataset with hierarchical taxonomies.
Abstract
Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
