Pilot: Building the Federated Multimodal Instruction Tuning Framework

Baochen Xiong; Xiaoshan Yang; Yaguang Song; Yaowei Wang; and; Changsheng Xu

arXiv:2501.13985·cs.LG·January 27, 2025

Pilot: Building the Federated Multimodal Instruction Tuning Framework

Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and, Changsheng Xu

PDF

Open Access 1 Video

TL;DR

This paper introduces Pilot, a federated multimodal instruction tuning framework that enables collaborative learning of multimodal models across distributed devices, effectively handling task heterogeneity and enhancing cross-task knowledge transfer.

Contribution

We propose a novel federated multimodal instruction tuning framework with an 'adapter on adapter' design and adaptive parameter aggregation, addressing task heterogeneity in distributed multimodal learning.

Findings

01

Framework effectively captures personalized and general knowledge.

02

Adaptive aggregation improves parameter optimization.

03

Method performs well across different cross-task scenarios.

Abstract

In this paper, we explore a novel federated multimodal instruction tuning task(FedMIT), which is significant for collaboratively fine-tuning MLLMs on different types of multimodal instruction data on distributed devices. To solve the new task, we propose a federated multimodal instruction tuning framework(Pilot). Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction. Each client can not only capture personalized information of local data and learn task-related multimodal information, but also learn general knowledge from other tasks. In addition, we introduce an adaptive parameter aggregation strategy for text training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pilot: Building the Federated Multimodal Instruction Tuning Framework· underline

Taxonomy

TopicsSpeech and dialogue systems