Distilling Knowledge from Text-to-Image Generative Models Improves   Visio-Linguistic Reasoning in CLIP

Samyadeep Basu; Shell Xu Hu; Maziar Sanjabi; Daniela Massiceti; Soheil; Feizi

arXiv:2307.09233·cs.CV·July 2, 2024·2 cites

Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP

Samyadeep Basu, Shell Xu Hu, Maziar Sanjabi, Daniela Massiceti, Soheil, Feizi

PDF

Open Access 1 Video

TL;DR

This paper introduces SDS-CLIP, a distillation method that enhances CLIP's ability to perform complex visio-linguistic reasoning tasks by leveraging objectives from generative models, leading to significant performance improvements.

Contribution

The paper proposes a novel distillation approach from generative models to improve CLIP's compositional visio-linguistic reasoning capabilities.

Findings

01

Up to 7% improvement on Winoground benchmark

02

Up to 3% boost on ARO dataset

03

Demonstrates effectiveness of generative model distillation for contrastive models

Abstract

Image-text contrastive models like CLIP have wide applications in zero-shot classification, image-text retrieval, and transfer learning. However, they often struggle on compositional visio-linguistic tasks (e.g., attribute-binding or object-relationships) where their performance is no better than random chance. To address this, we introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning. Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion, which are known for their strong visio-linguistic reasoning abilities. On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%. This work underscores the potential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

Methodsfail · Contrastive Language-Image Pre-training