Ultrasound Vision-Language Alignment via Contrastive Learning

Zhuoyang Lyu; Yiyang Zhang; Tongxin Wang; Ruirui Lan

arXiv:2605.02126·cs.CV·May 5, 2026

Ultrasound Vision-Language Alignment via Contrastive Learning

Zhuoyang Lyu, Yiyang Zhang, Tongxin Wang, Ruirui Lan

PDF

TL;DR

This paper introduces EchoCare-CLIP, a contrastive learning framework aligning ultrasound images with clinical text, improving cross-modal alignment and transferability in medical imaging tasks.

Contribution

The work presents a new ultrasound vision-language model trained on a large multi-organ dataset, demonstrating improved alignment and transfer performance over existing baselines.

Findings

01

Best model achieved a paired alignment score of 0.682.

02

Partial fine-tuning improved zero-shot classification accuracy.

03

Template-based captions matched or outperformed LLM-generated captions.

Abstract

Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.