FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning

Fabian Stricker; Jose A. Peregrina; David Bermbach; Christian Zirpins

arXiv:2605.07962·cs.LG·May 11, 2026

FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning

Fabian Stricker, Jose A. Peregrina, David Bermbach, Christian Zirpins

PDF

TL;DR

This paper introduces FLAM, a new performance evaluation method for federated learning that ensures consistent, centralized-like metrics without requiring a global test dataset.

Contribution

FLAM provides a novel aggregation approach for evaluation metrics in federated learning, addressing inconsistencies and generalizing beyond accuracy.

Findings

01

FLAM achieves identical results to centralized evaluation.

02

It works without a global test dataset.

03

It generalizes to various evaluation metrics.

Abstract

Performance evaluation is essential for assessing the quality of machine learning (ML) models and guiding deployment decisions. In federated learning (FL), assessing the performance is challenging because data are distributed across participants. Consequently, the coordinator must rely on locally computed evaluation metrics and aggregate them to assess the global model. A key challenge is that common aggregation strategies, such as weighted averaging based on the local samples per participant, do not always produce the same results as centralized evaluation. Existing definitions of performance evaluation are largely tailored to accuracy and do not generalize to other metrics, leading to inconsistencies between participant-based and centralized evaluation. However, such discrepancies are inconsistent with the FL objective and lead to a wrong calculation of the metric. To address this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.