FedUMM: A General Framework for Federated Learning with Unified Multimodal Models
Zhaolong Su, Leheng Zhao, Xiaoying Wu, Ziyue Xu, Jindong Wang

TL;DR
FedUMM introduces a federated learning framework for unified multimodal models that maintains performance while significantly reducing communication costs, enabling privacy-preserving distributed training.
Contribution
This paper proposes FedUMM, a novel federated learning approach for UMMs using parameter-efficient fine-tuning with adapters, addressing privacy and communication challenges.
Findings
Achieves competitive performance with centralized training.
Reduces communication by over an order of magnitude.
Maintains robustness under data heterogeneity.
Abstract
Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Advanced Graph Neural Networks · Big Data and Digital Economy
