FedPS: Federated data Preprocessing via aggregated Statistics
Xuefeng Xu, Graham Cormode

TL;DR
FedPS is a federated data preprocessing framework that uses aggregated statistics and data-sketching techniques to enable privacy-preserving, communication-efficient, and consistent data preprocessing for federated learning systems.
Contribution
This paper introduces FedPS, a novel federated preprocessing framework that leverages data sketches for efficient, privacy-preserving data summarization and extends preprocessing models to federated settings.
Findings
FedPS achieves efficient data summarization with privacy preservation.
It supports multiple preprocessing tasks like scaling, encoding, and imputation.
FedPS extends models such as k-Means and KNN to federated learning environments.
Abstract
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper provides a solution for a critical yet often overlooked aspect in Federated Learning. - The code attached to the paper is overall well-written and of good quality, and I agree with the authors claim that FedPS lays a strong groundwork for federated data pre-processing. - The paper is overall well-written, and the quality of presentation is good.
The main weakness of the paper is the lack of novelty. The paper is a description of a library for federated data pre-processing. Besides the library itself, the only part of the paper that could be considered as a significant contribution is the proposed algorithm for Federated Power-Transforms, which aims at addressing numerical issues in power transform through logarithmic transformation. While I overall like this paper and think that it is helpful for the FL community, I don't think that I
1. The paper introduces a robust methodology for federated data preprocessing through the FedPS tool, which leverages aggregated statistics and data sketching techniques. 2. The authors tackle numerical challenges associated with power transforms, which have been a limitation in previous research. By employing log-space computations and constrained optimization, the proposed Federated Power Transform algorithm enhances numerical stability and achieves superlinear convergence rates. 3. The paper
1. the scope of these experiments is limited in terms of the variety of datasets and models tested. very primitive models are tested, where models like resnet family etc should be experimentally tested with the proposed method. 2. While the paper discusses the necessity of federated data preprocessing in non-IID settings, it does not include detailed experimental results that specifically demonstrate the performance of the proposed methods under non-IID conditions. 3. A detailed privacy analysi
- Data preprocessing is often useful in practice. The consideration of such preprocessing operations in a federated learning scenario can have some practical usefulness.
- It is not clear what is the main contribution of this paper. It seems to be a straightforward combination of several existing techniques. This is also suggested in the list of main contributions on page 2, which does not include any fundamental technical problem that this paper solves. - The main paper does not discuss any unique characteristic of federated learning problems, where the privacy of data at clients, often including their statistics, needs to be preserved. There is some discussio
The problem of data preprocessing/analytics in federated networks is important. FedPS implements a set of data preprocessing tools, using tools like random sketching. It considers some numerical issues in one algorithm, and makes the implemented open-sourced as well.
Algorithmic contributions of this paper are limited. It centers around implementations of previous algorithms. It uses a known Log-Sum-Exp trick to handle numerical instabilities of power transform, as well as clipping the data when their absolute values are too big. For lots of the analytics tasks (estimating quantiles, estimating heavy hitters, etc) in federated learning, simply adding or merging local statistics may not be optimal. For instance, averaging local medians wouldn’t give us globa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Ethics and Social Impacts of AI · Stochastic Gradient Optimization Techniques
