FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models
Weiying Zheng, Ziyue Lin, Pengxin Guo, Yuyin Zhou, Feifei Wang, Liangqiong Qu

TL;DR
FedVLMBench is a comprehensive benchmark that evaluates federated fine-tuning strategies for vision-language models, addressing privacy concerns and providing insights into architecture and data heterogeneity effects.
Contribution
This work introduces the first systematic benchmark for federated fine-tuning of VLMs, covering multiple architectures, strategies, datasets, and tasks, with extensive experimental analysis.
Findings
A 2-layer MLP connector with concurrent tuning is optimal for encoder-based VLMs in FL.
FL methods are more sensitive to data heterogeneity in vision tasks than in text tasks.
The benchmark offers standardized tools and datasets for advancing privacy-preserving multimodal models.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict privacy requirements like healthcare. Recent efforts have introduced Federated Learning (FL) into VLM fine-tuning to address these privacy concerns, yet comprehensive benchmarks for evaluating federated fine-tuning strategies, model architectures, and task generalization remain lacking. In this work, we present \textbf{FedVLMBench}, the first systematic benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two mainstream VLM architectures (encoder-based and encoder-free), four…
Peer Reviews
Decision·Submitted to ICLR 2026
- The experiments are comprehensive, covering comparisons across model architectures, fine-tuning methods, and heterogeneity levels. - The paper is well-organized and clearly written, with strong visual support through well-designed figures and tables.
- While incorporating both encoder-based and encoder-free architectures enhances the comprehensiveness of the benchmark, this inclusion appears to be driven by empirical considerations rather than theoretical motivation. The introduction briefly states that existing FL studies mainly focus on encoder-based VLMs, and that encoder-free architectures have recently emerged, but it does not sufficiently justify why this comparison is fundamentally necessary in the federated context. The authors do no
This benchmark is meaningful as it incorporates a wider variety of VLM architectures, a more diverse set of tasks, and different fine-tuning strategies for VLMs under the Federated Learning (FL) framework. Such a design enhances the comprehensiveness and practical value of the evaluation, providing valuable insights and contributions to the research community.
Although the benchmark is meaningful, it still has several shortcomings: 1. Experimental aspect: It only presents the performance of different VLMs across various tasks and fine-tuning strategies. However, the experimental results, especially under different modes (F-C, F-L, F-CL, F-2stage), show limited differences, and it is unclear whether testing such combinations of fine-tuning strategies is truly necessary for FL; 2. Technical aspect: The work lacks technical innovation. It merely evaluat
Strength: 1. The paper considered a wide range of downstream tasks, 2 main VLM architectures, 4 mainstream finetuning strategies, and multiple datasets across different domains, which makes the benchmark more comprehensive. 2. While current VLMs are strong on natural image and language tasks, they still underperform on domain-specific tasks like medical imaging. Adapting VLMs faces two real issues: limited data and strict privacy. This paper gives a practical, comprehensive guideline to persona
Weakness & question: 1. While the considered VLM approaches are very up-to-dated, most compared baselines are from 5 years ago. Is there any reason for choosing these methods? 2. While the 5 takeaways seems very valuable and reasonable, I'm interested in takeaway 1 - why 2-layer should be better than 6-layer? Do you have any insight or analysis experiments to further explore this phenomenon? 3. In domains that have privacy concerns, like heathcare, the fairness problem is very pronounced. Whil
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
