FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data
Binqian Xu, Xiangbo Shu, Haiyang Mei, Guosen Xie, Basura Fernando, and, Jinhui Tang

TL;DR
This paper introduces FedMLLM, a federated fine-tuning framework for multimodal large language models that addresses multimodal heterogeneity, providing a benchmark and demonstrating improved performance in privacy-sensitive, heterogeneous data scenarios.
Contribution
The paper presents a new benchmark for federated fine-tuning of MLLMs on heterogeneous multimodal data and proposes a general FedMLLM framework with modality-agnostic strategies.
Findings
Benchmark covers diverse multimodal heterogeneity scenarios.
FedMLLM improves MLLM performance across multiple datasets.
Framework effectively mitigates multimodal heterogeneity challenges.
Abstract
Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the \textbf{multimodal heterogeneities} in real-world applications. In this paper, we introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios, laying the groundwork for future research in the field. Our benchmark includes two lightweight MLLMs, two downstream tasks, three evaluation metrics, and five datasets across three domains, along with six comparison baselines,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
