LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation
Mohammad Robaitul Islam Bhuiyan, Sheethal Bhat, Melika Qahqaie, Tri-Thien Nguyen, Paula Andrea Perez-Toro, Tomas Arias-Vergara, Andreas Maier

TL;DR
LoGSAM introduces a parameter-efficient, speech-guided MRI tumor segmentation framework that leverages pretrained foundation models with minimal updates, achieving state-of-the-art accuracy and high case-level reliability.
Contribution
It presents a novel modular pipeline combining speech transcription, NLP, vision-language detection, and segmentation models with minimal parameter updates for MRI tumor segmentation.
Findings
Achieved 80.32% dice score on BRISC 2025 dataset.
Attained 91.7% case-level accuracy on unseen MRI scans.
Enabled efficient domain adaptation with only 5% of model parameters updated.
Abstract
Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
