Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification

Kaouther Mouheb; Marawan Elbatel; Janne Papma; Geert Jan Biessels; Jurgen Claassen; Huub Middelkoop; Barbara van Munster; Wiesje van der Flier; Inez Ramakers; Stefan Klein; and Esther E. Bron

arXiv:2508.21458·cs.CV·October 16, 2025

Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification

Kaouther Mouheb, Marawan Elbatel, Janne Papma, Geert Jan Biessels, Jurgen Claassen, Huub Middelkoop, Barbara van Munster, Wiesje van der Flier, Inez Ramakers, Stefan Klein, and Esther E. Bron

PDF

TL;DR

This study evaluates how different design choices affect federated fine-tuning of foundation models for MRI-based dementia classification, providing insights for practical deployment in decentralized clinical environments.

Contribution

It systematically assesses the impact of classification head design, fine-tuning strategies, and aggregation methods on federated foundation model performance using brain MRI data.

Findings

01

Classification head architecture significantly affects performance.

02

Freezing the FM encoder yields results comparable to full fine-tuning.

03

Advanced aggregation methods outperform standard federated averaging.

Abstract

While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.

Tables2

Table 1. Table 2: Efficiency metrics for experiments on classification head (CLS) architecture and fine-tuning technique. Efficiency for CLS is assessed with the CLS-only finetuning setting, efficiency for fine-tuning techniques with the “CONV S” classification head.

Experiment

Trainable

Params (p)

Message

Size (kB)

Latency

(ms)

GPU Mem

(GB)

Energy

(MJ)

FLOPs

(G)

ResNet18

33M

232

6.1

44

2.5

251

NCC

0

0.5

1.2

14

0.2

184

Linear

770

0.5

1.3

14

0.4

184

CONV S

1.7M

16

1.9

14

2.9

185

CONV L

4.2M

36

3.0

14

2.9

186

All

92M + 1.7M

424

6.9

44

4.1

185

LoRA ALL

294k + 1.7M

19

2.0

40

4.0

186

LoRA First 6

147k + 1.7M

18

2.0

40

4.0

185

LoRA Last 6

147k + 1.7M

18

2.0

26

3.8

185

Table 2. Table 3: AUC scores per client across all experiments, reported as average [95% CI] with bootstrapping on the test set. The highest average AUC for each client and experiment is highlighted in bold. RML: Rate-My-LoRA.

Method	ADNI	BrainLAT	NACC	NIFD	OASIS	PND	All
Baselines
ResNet18	0.93 [0.90, 0.96]	0.82 [0.74, 0.88]	0.88 [0.86, 0.91]	0.79 [0.69, 0.88]	0.88 [0.76, 0.94]	0.83 [0.73, 0.90]	0.86 [0.84, 0.88]
NCC	0.74 [0.69, 0.79]	0.72 [0.63, 0.80]	0.72 [0.68, 0.75]	0.74 [0.62, 0.83]	0.79 [0.64, 0.89]	0.74 [0.63, 0.83]	0.71 [0.69, 0.73]
Centralized	0.91 [0.87, 0.94]	0.84 [0.77, 0.90]	0.89 [0.86, 0.91]	0.84 [0.74, 0.91]	0.86 [0.63, 0.94]	0.83 [0.74, 0.90]	0.87 [0.86, 0.89]
Classification Head Architecture
Linear	0.81 [0.76, 0.85]	0.74 [0.65, 0.82]	0.79 [0.76, 0.82]	0.84 [0.74, 0.91]	0.82 [0.62, 0.93]	0.73 [0.62, 0.82]	0.76 [0.74, 0.78]
CONV S	0.91 [0.87, 0.94]	0.81 [0.73, 0.87]	0.89 [0.86, 0.91]	0.84 [0.75, 0.91]	0.83 [0.63, 0.93]	0.89 [0.81, 0.94]	0.86 [0.84, 0.87]
CONV L	0.90 [0.86, 0.93]	0.83 [0.75, 0.89]	0.90 [0.88, 0.92]	0.87 [0.78, 0.93]	0.86 [0.69, 0.95]	0.86 [0.77, 0.92]	0.86 [0.85, 0.88]
Fine-tuning Method
All	0.91 [0.87, 0.93]	0.82 [0.73, 0.88]	0.88 [0.86, 0.90]	0.92 [0.84, 0.96]	0.80 [0.59, 0.91]	0.83 [0.73, 0.90]	0.86 [0.84, 0.88]
CLS Only	0.91 [0.87, 0.94]	0.81 [0.73, 0.87]	0.89 [0.86, 0.91]	0.84 [0.75, 0.91]	0.83 [0.63, 0.93]	0.89 [0.81, 0.94]	0.86 [0.84, 0.87]
LoRA All	0.90 [0.86, 0.93]	0.82 [0.74, 0.87]	0.89 [0.87, 0.91]	0.84 [0.75, 0.91]	0.87 [0.71, 0.94]	0.87 [0.77, 0.93]	0.86 [0.84, 0.87]
LoRA first 6	0.91 [0.87, 0.93]	0.81 [0.72, 0.87]	0.90 [0.88, 0.92]	0.83 [0.73, 0.90]	0.85 [0.67, 0.94]	0.83 [0.73, 0.90]	0.86 [0.84, 0.88]
LoRA last 6	0.89 [0.85, 0.92]	0.79 [0.70, 0.85]	0.89 [0.87, 0.91]	0.86 [0.77, 0.92]	0.80 [0.57, 0.91]	0.83 [0.73, 0.91]	0.85 [0.83, 0.86]
Aggregation Technique
SimpleAvg	0.86 [0.81, 0.89]	0.80 [0.72, 0.87]	0.85 [0.82, 0.87]	0.90 [0.82, 0.95]	0.86 [0.71, 0.93]	0.87 [0.79, 0.93]	0.84 [0.82, 0.85]
FedAvg	0.91 [0.87, 0.94]	0.81 [0.73, 0.87]	0.89 [0.86, 0.91]	0.84 [0.75, 0.91]	0.83 [0.63, 0.93]	0.89 [0.81, 0.94]	0.86 [0.84, 0.87]
FedCE	0.92 [0.88, 0.94]	0.80 [0.72, 0.86]	0.90 [0.88, 0.92]	0.89 [0.81, 0.94]	0.86 [0.63, 0.96]	0.92 [0.85, 0.96]	0.87 [0.86, 0.89]
RML	0.92 [0.88, 0.94]	0.82 [0.74, 0.88]	0.91 [0.88, 0.93]	0.84 [0.74, 0.91]	0.86 [0.59, 0.95]	0.88 [0.79, 0.94]	0.87 [0.86, 0.89]

Equations2

W_{LoRA} = W + Δ W = W + AB, A \in R^{d \times r}, B \in R^{r \times k}, r ≪ min (d, k)

W_{LoRA} = W + Δ W = W + AB, A \in R^{d \times r}, B \in R^{r \times k}, r ≪ min (d, k)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Dept. of Radiology & Nuclear Medicine, Erasmus MC, Rotterdam, the Netherlands 22institutetext: The Hong Kong University of Science and Technology, Hong Kong SAR 33institutetext: Dept. of Neurology, Erasmus MC, Rotterdam, the Netherlands 44institutetext: Dept. of Neurology, UMC Utrecht, Utrecht, the Netherlands 55institutetext: Dept. of Geriatrics, Radboud UMC, Nijmegen, the Netherlands 66institutetext: Dept. of Neurology, Leiden UMC, Leiden, the Netherlands 77institutetext: Dept. of Internal Medicine, UMC Groningen, Groningen, the Netherlands 88institutetext: Dept. of Neurology, Amsterdam UMC location VUmc, Amsterdam, the Netherlands 99institutetext: Dept. of Psychiatry & Psychology, Maastricht UMC, Maastricht, the Netherlands

Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification

Kaouther Mouheb Corresponding author: [email protected]

Marawan Elbatel 22

Janne Papma 33

Geert Jan Biessels 44

Jurgen Claassen 55

Huub Middelkoop 66

Barbara van Munster 77

Wiesje van der Flier 88

Inez Ramakers 99

Stefan Klein 11

Esther E. Bron 11

Abstract

While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.

Keywords:

Federated learning Foundation models Dementia MRI

1 Introduction

The accurate and early diagnosis of dementia is crucial for effective intervention and care [3]. Training AI models for this task requires diverse, multi-center datasets to capture patient variability. However, centralizing such data raises significant privacy concerns [22]. Federated learning (FL) addresses this challenge by allowing collaborative training while preserving data privacy [17, 14]. However, FL faces challenges such as inter-client heterogeneity, which can hinder model convergence and performance [4]. Foundation models (FMs) are large-scale models pre-trained on extensive datasets, capable of generalizing across diverse tasks. Integrating FMs into FL offers a promising approach to improve performance, as they act as powerful feature extractors, allowing efficient transfer learning for down-stream tasks [7, 5]. Fine-tuning FMs in federated settings involves critical design decisions, including the selection of the classification head, fine-tuning strategy, and aggregation method. While prior research has explored some of these aspects in medical image segmentation [18] and 2D classification [2], their impact on 3D medical image classification remains unexplored. This gap underscores the need for systematic investigation to optimize federated FM fine-tuning in this domain.

Transfer learning from FMs have shown strong promise in medical imaging. Baharoon et al. found that DINOv2, a general-purpose 2D FM, shows high performance in various medical tasks, including MRI-based classification [5, 23]. Wang et al. built SAM-Med3D, an FM trained fully on 3D medical images using a multimodal dataset of 140,000 scans [29]. In dementia research, Xue et al. built a multimodal FM for dementia diagnosis using public datasets [30], building on the SwinUNETR model [28]. The intersection of FL and FMs in medical tasks is gaining attention [16]. For instance, Rate-My-LoRA was proposed to fine-tune FMs for cardiac MRI segmentation [12]. demonstrating the potential of FL to enhance 3D medical FMs

Comparing these studies highlights key design choices that can impact model performance and efficiency. For example, Xue et al. integrated a convolutional adapter on top of Swin-UNETR, while Baharoon et al. used a simple linear layer as a classification head [30, 5]. Fine-tuning methods vary, with some freezing the FM backbone [30], while others use parameter-efficient fine-tuning (PEFT) methods such as low-rank adaptation (LoRA) to reduce communication overhead [13, 12]. Aggregation methods also differ, from traditional algorithms such as FedAvg [20], to more advanced methods such as Rate-My-LoRA [12]. Existing work on federated FMs mainly focuses on 2D modalities and segmentation tasks. Moreover, many use simulated federations due to limited multi-center datasets, raising concerns about their clinical relevance and viability in medical practice. Thus, the impact of the different design choices on the performance and efficiency of federated FM fine-tuning for dementia diagnosis is poorly understood, highlighting the need for a rigorous and systematic evaluations on real-world multi-center datasets.

In this work, we present a comprehensive empirical study on federated fine-tuning of a 3D FM (SAM-Med3D) for MRI-based dementia diagnosis. Our key contributions are: (i) We develop an open-source framework for evaluating federated fine-tuning of 3D FMs in medical image classification. (ii) We conduct a systematic analysis of three key design factors: classification head architecture, fine-tuning strategy, and federated aggregation technique, demonstrating their impact on diagnostic performance and efficiency. (iii) We benchmark the methods on a large dataset of 6076 samples from multiple cohorts of diverse sources, offering actionable insights into deploying federated FMs in clinical settings.

2 Materials and Method

2.1 Problem Formulation

We aim to fine-tune an FM for a classification task using FL. The training data is distributed across a number of clients $N$ , with the $i$ -th client having a local dataset $\mathcal{D}_{i}=\{(\mathbf{x}_{i,j},y_{i,j})\}_{j=1}^{n_{i}}$ where $\mathbf{x}_{i,j}$ represents the input 3D scan, and $y_{i,j}\in\{1,2,\ldots,C\}$ is the corresponding class label. The model $f(\mathbf{x};\boldsymbol{\theta})$ comprises two main components: a pre-trained FM that serves as a feature extraction backbone $g(\mathbf{x};\boldsymbol{\theta}_{g})$ and a classification head $h(\mathbf{z};\boldsymbol{\theta}_{h})$ , where $\mathbf{z}=g(\mathbf{x};\boldsymbol{\theta}_{g})$ is the feature representation. The overall model is expressed as $f(\mathbf{x};\boldsymbol{\theta})=h(g(\mathbf{x};\boldsymbol{\theta}_{g});\boldsymbol{\theta}_{h})$ , with $\boldsymbol{\theta}=\{\boldsymbol{\theta}_{g},\boldsymbol{\theta}_{h}\}$ denoting the model parameters. In FL, the model is trained at each client. The resulting local models $f_{i}(\mathbf{x};\boldsymbol{\theta}_{i})$ are aggregated on the server into a global model $f(\mathbf{x};\boldsymbol{\theta}_{\text{glob}})$ , computed as a weighted average of the local models: $\theta_{\text{glob}}=\sum_{i=1}^{N}\omega_{i}\boldsymbol{\theta}_{i}$ where $\omega_{i}$ denotes the aggregation weight for client $i$ , determined by the aggregation method. The global model is sent back to the clients for the next round. This process is repeated for a number of rounds $R$ .

2.2 Evaluation Framework

We identify three design choices that can impact performance and efficiency: the classification head architecture, the fine-tuning technique, and the aggregation method. To systematically assess their impact, each element is evaluated independently under consistent conditions. SAM-Med3D’s image encoder [29] is used as the backbone $g(\mathbf{x};\boldsymbol{\theta}_{g})$ . This choice is motivated by the 3D nature of the model and its large medical training set. The framework is illustrated in Fig. 1. Classification head architecture: The output of SAM-Med3D’s encoder is 384 feature maps of shape ( $8\times 8\times 8$ ). To classify samples based on this output, we evaluate three classification head architectures: (1) Linear: an average pooling layer followed by a linear layer; (2) CONV S: a lightweight 4-layer convolutional block with 128, 64, 64, and 32 kernels, followed by a linear layer; and (3) CONV L: a 4-layer convolutional block with 256, 128, 128, and 64 kernels, followed by a linear layer. These architectures are selected based on prior work [5, 30] to explore a trade-off between representational capacity and efficiency.

Fine-tuning method: We compare 3 methods: (1) Full: tuning all parameters in the model. (2) CLS Only (linear probing): freezing the backbone $g(\mathbf{x};\boldsymbol{\theta}_{g})$ and training only the classifier $h(\mathbf{z};\boldsymbol{\theta}_{h})$ . (3) LoRA: a technique that reduces the number of parameters to be trained. Let $\boldsymbol{\theta}_{g}$ denote the pre-trained parameters of the backbone $g(\mathbf{x};\boldsymbol{\theta}_{g})$ . Suppose a linear layer in the encoder has a weight matrix $\mathbf{W}\in\mathbb{R}^{d\times k}$ . LoRA introduces a trainable low-rank update $\Delta\mathbf{W}\in\mathbb{R}^{d\times k}$ as:

[TABLE]

where only the weights $\mathbf{A}$ and $\mathbf{B}$ are trainable. The backbone function becomes: $g_{\text{LoRA}}(\mathbf{x})=g(\mathbf{x};\boldsymbol{\theta}_{g},\Delta\boldsymbol{\theta}_{g})$ , where $\Delta\boldsymbol{\theta}_{g}$ consists of the low-rank parameters $\{\mathbf{A}^{(l)},\mathbf{B}^{(l)}\}_{l\in\mathcal{L}}$ for a subset of layers $\mathcal{L}$ to which LoRA is applied. We test configurations where LoRA is applied to all, the first 6, or the last 6 attention blocks of the encoder. We focus on linear probing and LoRA as they are the most commonly used PEFT methods, known for preserving pre-trained representations while reducing computational cost, making them well-suited for FL [9].

Aggregation technique: We evaluate two traditional aggregation methods, simple averaging, which assigns equal weights to all clients ( $\omega_{i}=1/N$ ), and FedAvg which weights clients based on their dataset size ( $\omega_{i}=\nicefrac{{|\mathcal{D}_{i}|}}{{\sum_{j=1}^{N}|D_{j}|}}$ ). Furthermore, we explore two advanced methods: (1) FedCE [15]: the aggregation weight is given as $\omega_{i}=\omega_{i}^{\text{grad}}\times\omega_{i}^{\text{data}},$ where $\omega_{i}^{\text{grad}}$ measures the alignment of client $i$ ’s gradient with those of other clients, reflecting its contribution in gradient space, and $\omega_{i}^{\text{data}}$ is the validation error of the model obtained by aggregating all clients excluding client $i$ . A higher error is assumed to reflect greater contribution in the data space. (2) Rate-My-LoRA [12]: validation performance is monitored and higher weights are assigned to clients whose performance declines from the previous round. FedCE and Rate-My-LoRA were selected because they are designed to address client heterogeneity in 3D medical imaging, and Rate-My-LoRA specifically addresses federated fine-tuning of 3D FMs.

2.3 Implementation Details

We used Nvidia-Flare and MONAI [26, 8], with training distributed over 4 H100 GPUs. MRI scans were registered to the MNI template and skull-stripped [10, 27] and resized to $128^{3}$ (SAM-Med3D’s input size). Models were trained for $R=10$ rounds with a batch size of 8, learning rate of 0.001 and the AdamW optimizer. LoRA rank was set to $r=8$ , which outperformed other tested values ( $r=4,16$ ). The code is available at gitlab.com/radiology/neuro/fedmedsam_ad.

2.4 Baselines

We compare against three baselines: (1) a 3D CNN (ResNet18) trained from scratch using FedAvg, following current FL practices for dementia diagnosis [11]; (2) centralized fine-tuning with a frozen encoder and ‘CONV S’ classifier; and (3) a nearest centroid classifier (NCC) using frozen encoder features, where clients share class-wise feature sums and counts. The global centroid for class $c$ is $\mu_{c}=\frac{1}{n_{c}}\sum_{i=1}^{N}\sum_{j:y_{i,j}=c}g(\mathbf{x}_{i,j};\boldsymbol{\theta}_{g})$ , with $n_{c}$ the total number of samples in class $c$ .

2.5 Datasets

We compiled a large dataset of 6,076 brain MRI scans from multiple sources reflecting a realistic and heterogeneous setting. The dataset consists of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [21], a clinical cohort of the Open Access Series of Imaging Studies (OASIS-4) [19], the Neuroimaging in Frontotemporal Dementia Study (NIFD) [25], the National Alzheimer’s Coordinating Center (NACC) cohort [6], the Latin American Brain Health Institute dataset (BrainLAT) [24], as well as the Health-RI Parelsnoer Neurodegenerative Diseases Biobank (PND) [1] which consists of data acquired from 8 medical centers in the Netherlands. These sources cover a wide range of dementia subtypes, geographical location and demographic variability, providing a robust benchmark for evaluating dementia diagnosis models in a federated setting. In our experiments, each cohort is treated as a client in the federation. The task is to classify dementia patients (DE) and cognitively normal (CN) individuals. Subjects without T1-weighted brain MRI scans were excluded from the analysis. Table 1 shows the label distribution per client. To illustrate inter-client variability in image appearance, we provide intensity histograms (Fig. S1, Appendix).

3 Experimental Results

We assess diagnostic performance using AUC, communication efficiency via average message size and latency, and computational efficiency via GPU memory, energy usage, and FLOPs per sample. AUC per client is reported in Table S.1. We compute 95% confidence intervals (CI) from 10,000 test-set bootstraps; non-overlapping CIs indicate statistical significance.

3.1 Classification Head Architecture

In this experiment, the SAM-Med3D encoder is frozen and FedAvg is used for aggregation. Classification performance is reported in Fig. 2 (a) and efficiency metrics are reported in Table 2.

Training the classifier significantly outperforms the NCC (AUC = 0.71, 95% CI: 0.69–0.73). Both convolutional heads significantly outperform the linear classifier (AUC = 0.76, 95% CI: 0.74–0.78), with “CONV S” and “CONV L” achieving similar performance (AUC = 0.86, 95% CI: 0.84–0.87 vs. 0.86, 95% CI: 0.85–0.88), both matching ResNet18 (AUC = 0.86, 95% CI: 0.84–0.88) while using <13% of its parameters and 75% of its FLOPs. Larger heads increase message size (0.5 kB for linear vs. 36 kB for “CONV L”) and latency (1.3 ms for linear vs. 3.0 ms for “CONV L”); and minimally impact computational efficiency. “CONV S” offers the best trade-off, retaining high AUC with lower communication cost.

3.2 Fine-tuning Method

This experiment uses “CONV S” as a classifier and FedAvg for aggregation. Diagnostic performance is shown in Fig. 2 (b) and efficiency metrics in Table 2.

None of the fine-tuning methods yielded higher performance than that obtained by linear probing (AUC = 0.86, 95% CI: 0.84–0.87). No significant differences are observed between LoRA configurations: LoRA All (AUC = 0.86, 95% CI: 0.84–0.87), LoRA First 6 (AUC = 0.86, 95% CI: 0.84–0.88), and LoRA Last 6 (AUC = 0.85, 95% CI: 0.83–0.86). Full-tuning achieves a similar performance (AUC = 0.86, 95% CI: 0.84-0.88) despite the larger number of trained parameters (92M). Fine-tuning the encoder increases computational cost compared to linear probing due to higher parameter counts and gradient computations.

3.3 Federated Aggregation Technique

Fig. 3 shows the performance per client for each aggregation method, alongside the results of centralized training for comparison. The aggregation weights obtained with each method per round are presented in Fig. S.2 in the Appendix.

Across the entire test set, both Rate-My-LoRA and FedCE match the performance of centralized training, with an AUC of 0.87 (95% CI: 0.86–0.89). These methods slightly outperform FedAvg (AUC = 0.86, 95% CI: 0.84–0.87) and significantly outperform simple averaging (AUC = 0.84, 95% CI: 0.82–0.85). At the client level, ADNI and NACC show the highest gain with advanced aggregation methods, particularly when compared to simple averaging. Although not statistically significant, FedCE outperforms Rate-My-LoRA on PND (AUC = 0.92 vs. 0.88) and NIFD (AUC = 0.89 vs. 0.84). In particular, BrainLAT, which exhibits a slightly different intensity distribution, consistently yields lower AUCs across methods and does not benefit from advanced aggregation strategies.

4 Discussion

In this work, we implemented a framework for a systematic evaluation of federated FM fine-tuning for dementia classification using T1-weighted MRI. We investigated three key design choices and their impact on model performance and efficiency, leveraging a large dataset consisting of 6 different cohorts.

Our results show that FL enables effective fine-tuning of the SAM-Med3D segmentation model for a classification task, achieving comparable performance to conventional CNNs with higher efficiency. In addition, it approaches the performance of centralized fine-tuning, underscoring the promise of integrating FL with FMs for AI-based dementia diagnosis. Our findings indicate that the classification head architecture has a substantial impact on performance. Incorporating convolutional layers on top of SAM-Med3D enhances performance by adapting its segmentation features to the classification task. We find that fine-tuning the encoder provides no significant performance improvement over using it as a frozen backbone, highlighting the high quality of the pre-trained features. This is especially valuable in FL, as freezing the backbone greatly reduces communication overhead without compromising performance. We observe that advanced aggregation strategies improve overall diagnostic performance compared to conventional methods. However, a more detailed analysis reveals that the gain is primarily driven by clients with large datasets (e.g. NACC), while smaller clients see limited benefit. This suggests that relying solely on validation-based client weighting may be insufficient. Supporting this, we find that FedCE, which incorporates gradient information, outperforms Rate-My-LoRA, which depends exclusively on validation metrics in smaller clients. While real-world federations often involve heterogeneous hardware, we employ a homogeneous setup to minimize variability stemming from infrastructure differences. This controlled approach enables a more precise evaluation of how design choices influence efficiency.

While our study introduces a flexible framework for MRI-based dementia diagnosis with a broader applicability to other medical imaging tasks, it has a number of limitations. First, the evaluation is constrained by the scarcity of open-source 3D FMs for MRI data, limiting our experiments to SAM-Med3D and a selected set of design choices. As the field progresses and more models become available, future work will benchmark alternative FMs to determine their suitability for FL environments. Additionally, while we focus on linear probing and LoRA as the current de facto approaches in PEFT, emerging FL-specific methods could further optimize performance and communication efficiency. Exploring these techniques will be critical as FMs gain traction in FL. Finally, a deeper theoretical analysis of aggregation methods is essential. While the evaluated methods rely mostly on validation performance, integrating fairness-aware aggregation, convergence guarantees, and client-specific bias mitigation could improve both the performance and equity of the resulting models.

In essence, this work investigates federated fine-tuning of FMs within real-world, multi-source datasets, moving beyond simulated federated data to ensure clinical relevance and practical viability. By investigating foundational yet underexplored components of the federated fine-tuning paradigm, it lays the ground for broader and more in-depth future evaluations.

{credits}

4.0.1 Acknowledgements

This project is supported by a 2022 Erasmus MC Fellowship. Esther E. Bron is recipient of TAP-dementia, a ZonMw funded project (#10510032120003). Esther E. Bron and Stefan Klein are recipients of EUCAIM, Cancer Image Europe, co-funded by the European Union under Grant Agreement 101100633. Data used in this study was partially obtained from the National Alzheimer’s Coordinating Center (NACC) database. MRI imaging data are part of the SCAN initiative. The NACC database is funded by NIA/NIH Grant U24 AG072122. SCAN was funded as a U24 grant (AG067418).

4.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

Appendix 0.A Appendix

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Aalten, P., Ramakers, I.H., Biessels, G.J., De Deyn, P.P., Koek, H.L., Olde Rikkert, M.G., Oleksik, A.M., Richard, E., Smits, L.L., van Swieten, J.C., et al.: The Dutch Parelsnoer Institute-Neurodegenerative diseases; Methods, Design and Baseline Results. BMC neurology 14 , 1–8 (2014)
2[2] Alkhunaizi, N., Almalik, F., Al-Refai, R., Naseer, M., Nandakumar, K.: Probing the Efficacy of Federated Parameter-Efficient Fine-Tuning of Vision Transformers for Medical Image Classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 236–245 (2024)
3[3] Alzheimer’s Association: Why Get Checked? https://www.alz.org/alzheimers-dementia/diagnosis/why-get-checked , accessed: 2025-02-19
4[4] Babar, M., Qureshi, B., Koubaa, A.: Investigating the Impact of Data Heterogeneity on the Performance of Federated Learning Algorithms Using Medical Imaging. Plos one 19 (5), e 0302539 (2024)
5[5] Baharoon, M., Qureshi, W., Ouyang, J., Xu, Y., Aljouie, A., Peng, W.: Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DIN Ov 2 on Radiology Benchmarks (2023)
6[6] Beekly, D.L., Ramos, E.M., van Belle, G., Deitrich, W., Clark, A.D., Jacka, M.E., Kukull, W.A.: The National Alzheimer’s Coordinating Center (NACC) Database: an Alzheimer disease database. Alzheimer Disease & Associated Disorders 18 (4), 270–277 (2004)
7[7] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the Opportunities and Risks of Foundation Models. ar Xiv:2108.07258 (2021)
8[8] Cardoso, M.J., Li, W., Brown, R., Ma, N., Kerfoot, E., Wang, Y., Murrey, B., Myronenko, A., Zhao, C., Yang, D., et al.: MONAI: An Open-source Framework for Deep Learning in Healthcare. ar Xiv:2211.02701 (2022)