Multimodal Federated Learning via Contrastive Representation Ensemble
Qiying Yu, Yang Liu, Yimu Wang, Ke Xu, Jingjing Liu

TL;DR
This paper introduces CreamFL, a novel multimodal federated learning framework that enables heterogeneous client models to collaboratively learn richer representations without sharing raw data, improving performance on multimodal tasks.
Contribution
CreamFL allows training larger, heterogeneous models in federated settings using contrastive ensemble strategies, addressing modality and task gaps for better multimodal fusion.
Findings
Outperforms state-of-the-art FL methods on image-text retrieval.
Enhances multimodal representation through contrastive regularization.
Effective in tasks like visual question answering.
Abstract
With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL), a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
