Leveraging Vision-Language Foundation Models for Fine-Grained Downstream   Tasks

Denis Coquenet; Cl\'ement Rambour; Emanuele Dalsasso; Nicolas; Thome

arXiv:2307.06795·cs.CV·July 14, 2023

Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

Denis Coquenet, Cl\'ement Rambour, Emanuele Dalsasso, Nicolas, Thome

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multitask fine-tuning approach for vision-language models like CLIP, significantly improving their performance on fine-grained attribute detection, localization, and classification tasks.

Contribution

It proposes a positive/negative prompt-based fine-tuning strategy that enhances vision-language models' capabilities for detailed downstream tasks.

Findings

01

Improved fine-grained attribute detection and localization on bird datasets.

02

Enhanced classification accuracy on CUB200-2011.

03

Source code provided for reproducibility.

Abstract

Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

factodeeplearning/multitaskvlfm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsContrastive Language-Image Pre-training