Parameter-efficient Fine-tuning in Hyperspherical Space for   Open-vocabulary Semantic Segmentation

Zelin Peng; Zhengqin Xu; Zhilin Zeng; Yaoming Wang; Wei Shen

arXiv:2405.18840·cs.CV·December 3, 2024

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Wei Shen

PDF

Open Access

TL;DR

This paper introduces H-CLIP, a parameter-efficient fine-tuning method in hyperspherical space for CLIP, significantly improving open-vocabulary semantic segmentation performance with minimal parameter updates.

Contribution

H-CLIP proposes a symmetrical PEFT strategy with block-diagonal matrices and a dual communication module, mitigating modality misalignment and preserving generalization in CLIP.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Requires only about 4% of CLIP's parameters to be fine-tuned.

03

Effectively mitigates modality misalignment and maintains generalization.

Abstract

Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose H-CLIP a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training