MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja, Vemulapalli, Oncel Tuzel

TL;DR
MobileCLIP introduces an efficient image-text model optimized for mobile deployment, utilizing multi-modal reinforced training to enhance accuracy and speed while reducing training overhead, achieving state-of-the-art latency-accuracy tradeoffs.
Contribution
The paper presents MobileCLIP, a novel family of efficient models with a new training approach that transfers knowledge from captioning models and CLIP ensembles, improving mobile performance.
Findings
MobileCLIP-S2 is 2.3× faster and more accurate than previous CLIP models.
Achieves +2.9% average performance on 38 benchmarks.
10×-1000× improved learning efficiency over non-reinforced training.
Abstract
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗apple/MobileCLIP-S0model· 67 dl· ♡ 1367 dl♡ 13
- 🤗apple/MobileCLIP-Bmodel· 43 dl· ♡ 443 dl♡ 4
- 🤗apple/MobileCLIP-B-LTmodel· 17 dl· ♡ 1017 dl♡ 10
- 🤗apple/MobileCLIP-S1model· 63 dl· ♡ 1163 dl♡ 11
- 🤗apple/MobileCLIP-S2model· 151 dl· ♡ 11151 dl♡ 11
- 🤗apple/mobileclip_s0_timmmodel· 53 dl· ♡ 1253 dl♡ 12
- 🤗apple/mobileclip_s1_timmmodel· 45 dl· ♡ 345 dl♡ 3
- 🤗apple/mobileclip_s2_timmmodel· 51 dl· ♡ 651 dl♡ 6
- 🤗apple/MobileCLIP-S1-OpenCLIPmodel· 6.0k dl· ♡ 106.0k dl♡ 10
- 🤗apple/MobileCLIP-S2-OpenCLIPmodel· 55k dl· ♡ 1855k dl♡ 18
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
