Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton and, Neil Houlsby

TL;DR
LIMoE is a novel sparse mixture of experts model that enables effective multimodal learning with images and text, achieving state-of-the-art zero-shot image classification performance while addressing training stability and expert utilization challenges.
Contribution
The paper introduces LIMoE, a multimodal sparse MoE model trained with contrastive loss, featuring an entropy-based regularization to improve stability and expert balance, with significant performance gains.
Findings
LIMoE achieves 78.6% zero-shot ImageNet accuracy with comparable compute to CLIP-L/14.
Scaling LIMoE to larger models yields 84.1% accuracy, competitive with larger models.
Expert layers organically develop modality-specific specialization.
Abstract
Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
