Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Wei Shen

TL;DR
This paper introduces H-CLIP, a parameter-efficient fine-tuning method in hyperspherical space for CLIP, significantly improving open-vocabulary semantic segmentation performance with minimal parameter updates.
Contribution
H-CLIP proposes a symmetrical PEFT strategy with block-diagonal matrices and a dual communication module, mitigating modality misalignment and preserving generalization in CLIP.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Requires only about 4% of CLIP's parameters to be fine-tuned.
Effectively mitigates modality misalignment and maintains generalization.
Abstract
Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose H-CLIP a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
