Pilot: Building the Federated Multimodal Instruction Tuning Framework
Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and, Changsheng Xu

TL;DR
This paper introduces Pilot, a federated multimodal instruction tuning framework that enables collaborative learning of multimodal models across distributed devices, effectively handling task heterogeneity and enhancing cross-task knowledge transfer.
Contribution
We propose a novel federated multimodal instruction tuning framework with an 'adapter on adapter' design and adaptive parameter aggregation, addressing task heterogeneity in distributed multimodal learning.
Findings
Framework effectively captures personalized and general knowledge.
Adaptive aggregation improves parameter optimization.
Method performs well across different cross-task scenarios.
Abstract
In this paper, we explore a novel federated multimodal instruction tuning task(FedMIT), which is significant for collaboratively fine-tuning MLLMs on different types of multimodal instruction data on distributed devices. To solve the new task, we propose a federated multimodal instruction tuning framework(Pilot). Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction. Each client can not only capture personalized information of local data and learn task-related multimodal information, but also learn general knowledge from other tasks. In addition, we introduce an adaptive parameter aggregation strategy for text training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems
