Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis
Fengyi Zhang, Xujie Zeng, Mohan Liu, Zengyi Wang, Yalong Jiang

TL;DR
Rad-VLSM is a novel cross-modal framework that enhances medical lesion segmentation and diagnosis by leveraging semantic guidance and multi-task learning, improving accuracy and robustness.
Contribution
It introduces a two-stage, semantics-assisted approach combining vision-language alignment and multitask segmentation for better lesion focus and diagnosis.
Findings
Achieves strong segmentation and diagnostic performance on breast ultrasound data.
Reduces dependence on text-to-diagnosis mapping through lesion-level grounding.
Demonstrates favorable generalization across benchmarks.
Abstract
Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
