CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng

TL;DR
This paper introduces CLIP-MoE, a mixture of experts framework that enhances CLIP's ability to encode diverse features through diversified fine-tuning and dynamic expert activation, improving performance in multimodal tasks.
Contribution
The paper proposes a novel Diversified Multiplet Upcycling framework to fine-tune pre-trained CLIP models into a mixture of experts, capturing diverse feature subspaces efficiently.
Findings
CLIP-MoE outperforms baseline models in zero-shot retrieval tasks.
It achieves superior accuracy in zero-shot image classification.
Demonstrates improved downstream MLLM benchmark performance.
Abstract
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically…
Peer Reviews
Decision·Submitted to ICLR 2025
The experiments conducted in this study cover a wide range of tasks including zero-shot retrieval, classification, and MLLM understanding. By comparing against the original OpenAI CLIP large model, LongCLIP, and Llava-1.5, the results demonstrate superior performance of the proposed method. Additionally, this work shows that the Diversified Multiplet Upcycling (DMU) is more efficient compared to simply upcycling the FFN to a MoE.
1. The zero-shot image classification results should be more diverse. Only including ImageNet, ImageNet-O, ImageNet-V2, CIFAR-10, and CIFAR-100 is not sufficient. Please refer to the CLIP benchmark [1]. I believe that including ImageNet, ImageNetV2, ImageNet-A, ImageNet-R, ImageNet-Sketch, and ObjectNet datasets is essential for a more comprehensive evaluation. 2. I'm interested in the performance on the MM-Vet benchmark. Reference: [1] https://github.com/LAION-AI/CLIP_benchmark
1. This work claims to be the first attempt to apply MoE to a CLIP-style model. I also believe it is an early attempt at incorporating MoE into contrastive learning. 2. The authors propose a novel method for initializing multiple experts in MoE, using MCL to first train and obtain diverse yet meaningful experts. 3. The proposed method is efficient, as it does not require retraining from scratch. 4. The paper is well-written, allowing me to quickly understand the authors’ key ideas.
1. The motivation behind this work—that CLIP often encodes inputs in a very coarse-grained manner—is neither novel nor particularly compelling. Additionally, the paper lacks an in-depth analysis of why this issue arises and does not clearly explain how the proposed method addresses it. The introduction could benefit from further refinement, as it currently seems to have chosen a weak motivation simply to justify applying MoE to CLIP. 2. The rationale for using Multistage Contrastive Learning (M
1. It introduces a new perspective on fine-tuning CLIP with Multistage Contrastive Learning (MCL) and validates this method through proper experiments. 2. The results show improvements across various downstream tasks, indicating its potential effectiveness in practical applications, supported by coherent writing and a relative reasonable experimental setup.
1. A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. 2. Firstly, there is a mislabeling in the paper. In Section 5.5, Table 1 presents the results of th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Semantic Web and Ontologies · Mobile Crowdsensing and Crowdsourcing
MethodsContrastive Language-Image Pre-training · Mixture of Experts
