Ultrasound Vision-Language Alignment via Contrastive Learning
Zhuoyang Lyu, Yiyang Zhang, Tongxin Wang, Ruirui Lan

TL;DR
This paper introduces EchoCare-CLIP, a contrastive learning framework aligning ultrasound images with clinical text, improving cross-modal alignment and transferability in medical imaging tasks.
Contribution
The work presents a new ultrasound vision-language model trained on a large multi-organ dataset, demonstrating improved alignment and transfer performance over existing baselines.
Findings
Best model achieved a paired alignment score of 0.682.
Partial fine-tuning improved zero-shot classification accuracy.
Template-based captions matched or outperformed LLM-generated captions.
Abstract
Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
