GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning
Mainak Singha, Ankit Jha, Biplab Banerjee

TL;DR
GOPro introduces a unified prompt learning framework that combines CLIP and self-supervised learning to improve domain generalization in visual recognition tasks, addressing multi-task challenges.
Contribution
It proposes a novel prompt learning model with a shared embedding space, incorporating multiple loss functions, to enhance invariance and generalizability in CLIP-based models.
Findings
Outperforms state-of-the-art prompting methods on multiple benchmarks
Demonstrates significant improvements in domain generalization tasks
Effectively combines CLIP with self-supervised learning for robust visual recognition
Abstract
Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Image Processing Techniques and Applications · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
