# MobileCLIP2: Improving Multi-Modal Reinforced Training

**Authors:** Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari

arXiv: 2508.20691 · 2025-08-29

## TL;DR

MobileCLIP2 advances multi-modal reinforced training to produce lightweight, high-accuracy image-text models with improved zero-shot performance, leveraging enhanced teacher ensembles, fine-tuned captioners, and synthetic captioning techniques.

## Contribution

It introduces MobileCLIP2, a new family of models with improved training methods, achieving state-of-the-art zero-shot accuracy at low latency and size.

## Key findings

- MobileCLIP2-B improves ImageNet-1k accuracy by 2.2% over MobileCLIP-B.
- MobileCLIP2-S4 matches SigLIP-SO400M/14 accuracy while being 2× smaller.
- MobileCLIP2 outperforms DFN ViT-L/14 at 2.5× lower latency.

## Abstract

Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20691/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20691/full.md

## References

82 references — full list in the complete paper: https://tomesphere.com/paper/2508.20691/full.md

---
Source: https://tomesphere.com/paper/2508.20691