Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models
Hamza Rasaee, Taha Koleilat, Hassan Rivaz

TL;DR
This paper introduces a prompt-driven vision-language model combining Grounding DINO and SAM2, fine-tuned with LoRA on ultrasound datasets, to improve multi-organ segmentation accuracy and generalization across diverse ultrasound images.
Contribution
The study presents a novel integration of Grounding DINO and SAM2 with LoRA tuning for ultrasound segmentation, demonstrating superior performance and robustness without extensive organ-specific data.
Findings
Outperforms state-of-the-art segmentation methods on most datasets
Maintains strong performance on unseen datasets without additional fine-tuning
Reduces dependence on large annotated datasets for ultrasound segmentation
Abstract
Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVision Transformer · self-DIstillation with NO labels
