CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, Xianzhi Du

TL;DR
CLIP-UP introduces an efficient method to convert dense CLIP models into sparse MoE models, significantly reducing training costs while improving performance on image-text retrieval benchmarks.
Contribution
We propose a novel sparse upcycling training strategy that transforms pre-trained dense CLIP models into efficient MoE models, enhancing performance and reducing training complexity.
Findings
Sparse CLIP B/16 outperforms dense version by 7.2% on COCO
Achieves 6.6% improvement on Flickr30k
Surpasses larger CLIP L/14 with only 30% inference FLOPs
Abstract
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Speech Recognition and Synthesis · Handwritten Text Recognition Techniques
MethodsMixture of Experts · Contrastive Language-Image Pre-training
