A robust sampling technique for realistic distribution simulation in federated learning

Robin Hoepp; Leonhard Rist; Alexander Katzmann; Raghavan Ashok; Andreas Wimmer; Michael Sühling; Andreas Maier

PMC · DOI:10.1007/s11548-025-03504-z·September 2, 2025

A robust sampling technique for realistic distribution simulation in federated learning

Robin Hoepp, Leonhard Rist, Alexander Katzmann, Raghavan Ashok, Andreas Wimmer, Michael Sühling, Andreas Maier

PDF

Open Access

TL;DR

This paper introduces a sampling method to simulate realistic data distributions in federated learning, showing how client biases can degrade model performance.

Contribution

A novel sampling algorithm using chi-squared and Gini impurity to simulate non-IID label distributions in federated learning.

Findings

01

The proposed sampling technique realistically simulates client-biased label distributions.

02

A global model's performance dropped by 25.3% on weight and 28.7% on height estimation due to biased distributions.

03

The method is architecture-agnostic and applicable to various non-IID problems.

Abstract

Federated Learning helps training deep learning networks with diverse data from different locations, particularly in restricted clinical settings. However, label distributions overlapping only partially across clients, due to different demographics, may significantly harm the global training, and thus local model performance. Investigating such effects before rolling out large-scale Federated Learning setups requires proper sampling of the expected label distributions. We present a sampling algorithm to build data subsets according to desired mean and standard deviations from an initial global distribution. To this end, we incorporate the chi-squared and Gini impurity measures to numerically optimize label distributions for multiple groups in an efficient fashion. Using a real-world application scenario, we sample train and test groups according to region-specific distributions for 3D…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

IID

Figures12

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Domain Adaptation and Few-Shot Learning · Colorectal Cancer Screening and Detection