Cloud-Device Collaborative Learning for Multimodal Large Language Models
Guanqun Wang, Jiaming Liu, Chenxuan Li, Junpeng Ma, Yuan Zhang, Xinyu, Wei, Kevin Zhang, Maurice Chong, Ray Zhang, Yijiang Liu, Shanghang Zhang

TL;DR
This paper presents a cloud-device collaborative framework for improving the performance of compressed multimodal large language models on client devices by leveraging cloud-based knowledge and efficient data transmission strategies.
Contribution
It introduces a novel framework with UTS, AKD, and DWC components to enhance model performance and transmission efficiency in multimodal LLM deployment.
Findings
Outperforms prior knowledge distillation methods.
Reduces transmission costs significantly.
Validates effectiveness through real-world experiments.
Abstract
The burgeoning field of Multimodal Large Language Models (MLLMs) has exhibited remarkable performance in diverse tasks such as captioning, commonsense reasoning, and visual scene understanding. However, the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters, leading to a notable decline in generalization capabilities when these models are compressed for device deployment. Addressing this challenge, we introduce a Cloud-Device Collaborative Continual Adaptation framework, designed to enhance the performance of compressed, device-deployed MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media
MethodsKnowledge Distillation
