VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks
Manish Dhakal, Rabin Adhikari, Safal Thapaliya, Bishesh Khanal

TL;DR
VLSM-Adapter is a lightweight, transformer-based adapter that enables efficient fine-tuning of vision-language segmentation models, achieving high performance with significantly fewer trainable parameters, especially beneficial for medical imaging tasks.
Contribution
We propose a novel VLSM-Adapter that allows effective fine-tuning of pretrained models with minimal parameters, reducing computational costs while maintaining state-of-the-art performance.
Findings
Outperforms existing methods with only 3 million trainable parameters.
Achieves comparable results to full fine-tuning on CLIP-based segmentation models.
Reduces resource requirements for medical image segmentation tasks.
Abstract
Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
MethodsTransformer · Adapter
