FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning
Fabian Stricker, Jose A. Peregrina, David Bermbach, Christian Zirpins

TL;DR
This paper introduces FLAM, a new performance evaluation method for federated learning that ensures consistent, centralized-like metrics without requiring a global test dataset.
Contribution
FLAM provides a novel aggregation approach for evaluation metrics in federated learning, addressing inconsistencies and generalizing beyond accuracy.
Findings
FLAM achieves identical results to centralized evaluation.
It works without a global test dataset.
It generalizes to various evaluation metrics.
Abstract
Performance evaluation is essential for assessing the quality of machine learning (ML) models and guiding deployment decisions. In federated learning (FL), assessing the performance is challenging because data are distributed across participants. Consequently, the coordinator must rely on locally computed evaluation metrics and aggregate them to assess the global model. A key challenge is that common aggregation strategies, such as weighted averaging based on the local samples per participant, do not always produce the same results as centralized evaluation. Existing definitions of performance evaluation are largely tailored to accuracy and do not generalize to other metrics, leading to inconsistencies between participant-based and centralized evaluation. However, such discrepancies are inconsistent with the FL objective and lead to a wrong calculation of the metric. To address this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
