CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Linhui Xiao; Xiaoshan Yang; Fang Peng; Ming Yan; Yaowei Wang,; Changsheng Xu

arXiv:2305.08685·cs.CV·November 20, 2024·1 cites

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang,, Changsheng Xu

PDF

Open Access 3 Repos 4 Models

TL;DR

CLIP-VG introduces a self-paced curriculum approach that adapts CLIP for unsupervised visual grounding, significantly improving performance by progressively refining pseudo-labels and outperforming existing methods.

Contribution

The paper proposes a novel self-paced curriculum adapting algorithm for CLIP in visual grounding, enhancing pseudo-label reliability and diversity, and achieving state-of-the-art results.

Findings

01

Outperforms current state-of-the-art unsupervised methods on RefCOCO datasets.

02

Achieves 6.78% to 14.87% improvements over existing methods.

03

Competitive results in fully supervised settings.

Abstract

Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training