MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced   Training

Pavan Kumar Anasosalu Vasu; Hadi Pouransari; Fartash Faghri; Raviteja; Vemulapalli; Oncel Tuzel

arXiv:2311.17049·cs.CV·April 2, 2024·5 cites

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja, Vemulapalli, Oncel Tuzel

PDF

Open Access 3 Repos 10 Models 5 Datasets

TL;DR

MobileCLIP introduces an efficient image-text model optimized for mobile deployment, utilizing multi-modal reinforced training to enhance accuracy and speed while reducing training overhead, achieving state-of-the-art latency-accuracy tradeoffs.

Contribution

The paper presents MobileCLIP, a novel family of efficient models with a new training approach that transfers knowledge from captioning models and CLIP ensembles, improving mobile performance.

Findings

01

MobileCLIP-S2 is 2.3× faster and more accurate than previous CLIP models.

02

Achieves +2.9% average performance on 38 benchmarks.

03

10×-1000× improved learning efficiency over non-reinforced training.

Abstract

Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training