Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Mingkun Tan; Xilu Wang; Michael Kloster; Tim W. Nattkemper

arXiv:2603.29633·cs.CV·April 1, 2026

Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Mingkun Tan, Xilu Wang, Michael Kloster, Tim W. Nattkemper

PDF

TL;DR

This paper investigates self-supervised federated learning for diatom classification, focusing on data heterogeneity in unlabeled data volume and label-space, proposing new partitioning schemes and adaptive methods to improve performance.

Contribution

It introduces PreDi for controllable heterogeneity simulation and PreP-WFL for adaptive class representation enhancement, advancing understanding of heterogeneity effects in federated learning.

Findings

01

Heterogeneity in unlabeled data volume improves pre-training.

02

Prevalence dominates performance under label-space heterogeneity.

03

PreP-WFL mitigates performance degradation in low-prevalence scenarios.

Abstract

Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.