Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Mayug Maniparambil; Chris Vorster; Derek Molloy; Noel Murphy; Kevin; McGuinness; Noel E. O'Connor

arXiv:2307.11661·cs.CV·August 9, 2023

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin, McGuinness, Noel E. O'Connor

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates how GPT-4 can generate visually descriptive prompts to significantly improve CLIP's zero-shot and few-shot performance on specialized visual datasets, reducing the need for manual prompt engineering.

Contribution

The authors introduce a method using GPT-4 to generate descriptive prompts that enhance CLIP's adaptation to downstream tasks, outperforming existing prompt engineering and adapter methods.

Findings

01

Improved zero-shot accuracy on EuroSAT, DTD, SUN397, and CUB datasets.

02

A simple few-shot adapter outperforms CoCoOP by ~2% on average.

03

Significant performance gains with GPT-4 generated prompts.

Abstract

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mayug/vdt-adapter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Linear Layer · Softmax · Dense Connections · Multi-Head Attention · Dropout