VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu; Renrui Zhang; Ziyu Guo; Ziyao Zeng; Zilu Guo; Yafeng Li,; Guangnan Zhang

arXiv:2112.02399·cs.CV·August 11, 2023·28 cites

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu, Renrui Zhang, Ziyu Guo, Ziyao Zeng, Zilu Guo, Yafeng Li,, Guangnan Zhang

PDF

Open Access

TL;DR

VT-CLIP introduces a method to improve CLIP's cross-modal alignment by guiding textual features with visual information, enhancing transfer performance especially in few-shot classification tasks.

Contribution

The paper proposes VT-CLIP, a novel approach that makes textual features visually guided to better align with images, addressing semantic gaps in CLIP.

Findings

01

VT-CLIP outperforms baseline CLIP on 11 classification datasets.

02

Improves few-shot learning performance significantly.

03

Enhances category-wise matching accuracy.

Abstract

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training