DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko,, Xide Xia

TL;DR
This paper presents DIME-FM, a distillation method that transfers knowledge from large vision-language models like CLIP to smaller, more practical models using limited data, achieving comparable performance.
Contribution
The paper introduces a novel distillation mechanism enabling efficient transfer of large VLFM knowledge to smaller models with limited data, reducing resource requirements.
Findings
Distill-ViT-B/32 rivals CLIP-ViT-B/32 in performance.
Achieves similar zero-shot and linear-probing results on ImageNet and ELEVATER.
Displays comparable robustness on datasets with natural distribution shifts.
Abstract
Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsFlorence · ALIGN · Contrastive Language-Image Pre-training
