pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani

TL;DR
pFedMMA introduces a novel personalized federated learning framework using multi-modal adapters for vision-language models, effectively balancing personalization and global generalization across diverse datasets.
Contribution
It is the first to utilize multi-modal adapters in federated learning for vision-language tasks, enhancing personalization without sacrificing global generalization.
Findings
Achieves state-of-the-art trade-offs between personalization and generalization.
Outperforms recent federated prompt tuning methods across eleven datasets.
Effective in domain- and label-shift scenarios.
Abstract
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve…
Peer Reviews
Decision·ICLR 2026 Poster
(S1) The paper is very well-organized and well-written. It is delightfully easy to read. Intuitions from related work are provided in several places for proper contextualization. The problem, algorithm design, and experimental setups are all well motivated. Results and intuition are well communicated. (S2) Experiments and metrics presented provide good quality empirical evidence to support the claims in the paper. A diversity of datasets and experimental conditions covering data heterogeneity,
(W1) While related work is mostly well cited, I believe that one relevant paper [FedDAT] is missing. Contributions of this manuscript should be contextualized and differentiated w.r.t. this reference. [FedDAT] Chen, H., Zhang, Y., Krompass, D., Gu, J., & Tresp, V. (2024). FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38(10), 11285-11293. (W2) Lines 461-462: Based on Fig. 4, one
1. pFedMMA effectively introduces multi-modal adapters into personalized federated learning, balancing personalization and generalization. It addresses the poor generalization of existing prompt-tuning methods on unseen classes. 2. The asymmetric training mechanism, which aggregates only the shared projection layer, reduces communication costs while retaining modality-specific up- and down-projections locally to adapt to local data distributions. 3. Through extensive evaluation across diverse
1. Although communication cost is reduced, the total number of trainable parameters introduced by pFedMMA is significantly larger than mainstream prompt-tuning methods, increasing local computational and memory burdens, which may not be friendly to resource-constrained devices. 2. Despite achieving the best harmonic mean (HM) performance, pFedMMA shows noticeably lower local accuracy than pFedMoAP on several datasets (e.g., Flowers102 and DTD), indicating that its personalization capability is
Comprehensive Evaluation: The study spans diverse heterogeneity scenarios (label shifts via non-overlapping classes, feature shifts via multi-domain datasets like DomainNet), using Dirichlet partitioning for realistic non-IID data. Testing extensive datasets, two backbones (ViT-B/16, ViT-B/32), and few-shot regimes, provides robust evidence of applicability and interpretability. Efficiency and Scalability: As a parameter-efficient fine-tuning (PEFT) method, it freezes the VLM backbone, training
Unverified Cross-Modal Alignment: The core claim of achieving cross-modal consistency via the shared projection lacks rigorous validation. The parallel adapter design with a shared layer assumes modality interaction without explicit mechanisms (e.g., attention or fusion gates), and no quantitative evidence (e.g., cosine similarity, t-SNE visualizations) confirms reduced modality gaps or alignment under federated heterogeneity. Insufficient Motivation and Problem Framing: The motivation relies
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
MethodsAdapter · Contrastive Language-Image Pre-training
