Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, Guanbin, Li

TL;DR
This paper introduces a parameter-efficient tuning method for referring image segmentation, using a novel adapter and lightweight decoder to achieve high performance with minimal parameter updates.
Contribution
It proposes Bridger, a new adapter for cross-modal interaction, and a lightweight decoder, enabling effective dense prediction with minimal parameter tuning.
Findings
Achieves comparable or better performance with only 1.61% to 3.38% backbone parameter updates.
Demonstrates effectiveness on challenging benchmarks.
Provides a practical approach for resource-efficient dense prediction tasks.
Abstract
Parameter Efficient Tuning (PET) has gained attention for reducing the number of parameters while maintaining performance and providing better hardware resource savings, but few studies investigate dense prediction tasks and interaction between modalities. In this paper, we do an investigation of efficient tuning problems on referring image segmentation. We propose a novel adapter called Bridger to facilitate cross-modal information exchange and inject task-specific information into the pre-trained model. We also design a lightweight decoder for image segmentation. Our approach achieves comparable or superior performance with only 1.61\% to 3.38\% backbone parameter updates, evaluated on challenging benchmarks. The code is available at \url{https://github.com/kkakkkka/ETRIS}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · COVID-19 diagnosis using AI
MethodsAdapter
