Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks
Denis Coquenet, Cl\'ement Rambour, Emanuele Dalsasso, Nicolas, Thome

TL;DR
This paper introduces a multitask fine-tuning approach for vision-language models like CLIP, significantly improving their performance on fine-grained attribute detection, localization, and classification tasks.
Contribution
It proposes a positive/negative prompt-based fine-tuning strategy that enhances vision-language models' capabilities for detailed downstream tasks.
Findings
Improved fine-grained attribute detection and localization on bird datasets.
Enhanced classification accuracy on CUB200-2011.
Source code provided for reproducibility.
Abstract
Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Language-Image Pre-training
