VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with   Lightweight Blocks

Manish Dhakal; Rabin Adhikari; Safal Thapaliya; Bishesh Khanal

arXiv:2405.06196·cs.CV·June 28, 2024·1 cites

VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks

Manish Dhakal, Rabin Adhikari, Safal Thapaliya, Bishesh Khanal

PDF

Open Access 2 Repos

TL;DR

VLSM-Adapter is a lightweight, transformer-based adapter that enables efficient fine-tuning of vision-language segmentation models, achieving high performance with significantly fewer trainable parameters, especially beneficial for medical imaging tasks.

Contribution

We propose a novel VLSM-Adapter that allows effective fine-tuning of pretrained models with minimal parameters, reducing computational costs while maintaining state-of-the-art performance.

Findings

01

Outperforms existing methods with only 3 million trainable parameters.

02

Achieves comparable results to full fine-tuning on CLIP-based segmentation models.

03

Reduces resource requirements for medical image segmentation tasks.

Abstract

Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsTransformer · Adapter