Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Hamza Rasaee; Taha Koleilat; Hassan Rivaz

arXiv:2506.23903·cs.CV·September 10, 2025

Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Hamza Rasaee, Taha Koleilat, Hassan Rivaz

PDF

TL;DR

This paper introduces a prompt-driven vision-language model combining Grounding DINO and SAM2, fine-tuned with LoRA on ultrasound datasets, to improve multi-organ segmentation accuracy and generalization across diverse ultrasound images.

Contribution

The study presents a novel integration of Grounding DINO and SAM2 with LoRA tuning for ultrasound segmentation, demonstrating superior performance and robustness without extensive organ-specific data.

Findings

01

Outperforms state-of-the-art segmentation methods on most datasets

02

Maintains strong performance on unseen datasets without additional fine-tuning

03

Reduces dependence on large annotated datasets for ultrasound segmentation

Abstract

Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsVision Transformer · self-DIstillation with NO labels