MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift
Kiran Naseer, Naveed Anwer Butt

TL;DR
This paper introduces MedFL-Stress, a framework for stress-testing federated brain tumor segmentation models under MRI appearance shifts, revealing robustness issues masked by average performance metrics.
Contribution
It presents a systematic evaluation protocol that uncovers failure modes in federated learning models, emphasizing the importance of robustness metrics over mean accuracy.
Findings
FedBN reduces inter-hospital performance disparity by 41%.
Weakest hospital performance improves by 3.5 Dice points.
Average global Dice score remains high despite robustness issues.
Abstract
Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
