Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge
Ruiming Chen, Junming Yang, Shiyu Xia, Xu Yang, Jing Wang, Xin Geng

TL;DR
This paper introduces MM-LG, a framework that extracts and leverages multimodal generalizable knowledge from CLIP to initialize diverse models efficiently, reducing training costs and parameter storage while improving downstream task performance.
Contribution
The paper proposes MM-LG, a novel multimodal learngene extraction method that enhances model initialization, performance, and efficiency across various scales and modalities.
Findings
Achieves +3.1% on Oxford-IIIT PET and +4.13% on Flickr30k.
Reduces pre-training costs by approximately 2.8 times.
Uses only around 25% of parameter storage compared to traditional methods.
Abstract
CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
