FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Yan Gao; Massimo Roberto Scamarcia; Javier Fernandez-Marques; Mohammad Naseri; Chong Shen Ng; Dimitris Stripelis; Zexi Li; Tao Shen; Jiamu Bai; Daoyuan Chen; Zikai Zhang; Rui Hu; InSeo Song; Lee KangYoon; Hong Jia; Ting Dang; Junyan Wang; Zheyuan Liu; Daniel Janes Beutel; Lingjuan Lyu; Nicholas D. Lane

arXiv:2506.02961·cs.CL·December 1, 2025

FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel

PDF

Open Access

TL;DR

This paper introduces FlowerTune, a comprehensive benchmark suite for evaluating federated fine-tuning of large language models across multiple domains, addressing data privacy and domain adaptation challenges.

Contribution

It presents the first benchmarking suite for federated fine-tuning of LLMs across diverse domains, including datasets, evaluation metrics, and a comparative analysis of 26 models.

Findings

01

Federated fine-tuning performance varies across models and domains.

02

Resource constraints significantly impact federated LLM training.

03

Domain-specific adaptation improves model effectiveness.

Abstract

Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data