FedPS: Federated data Preprocessing via aggregated Statistics

Xuefeng Xu; Graham Cormode

arXiv:2602.10870·cs.LG·February 12, 2026

FedPS: Federated data Preprocessing via aggregated Statistics

Xuefeng Xu, Graham Cormode

PDF

Open Access 4 Reviews

TL;DR

FedPS is a federated data preprocessing framework that uses aggregated statistics and data-sketching techniques to enable privacy-preserving, communication-efficient, and consistent data preprocessing for federated learning systems.

Contribution

This paper introduces FedPS, a novel federated preprocessing framework that leverages data sketches for efficient, privacy-preserving data summarization and extends preprocessing models to federated settings.

Findings

01

FedPS achieves efficient data summarization with privacy preservation.

02

It supports multiple preprocessing tasks like scaling, encoding, and imputation.

03

FedPS extends models such as k-Means and KNN to federated learning environments.

Abstract

Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

- The paper provides a solution for a critical yet often overlooked aspect in Federated Learning. - The code attached to the paper is overall well-written and of good quality, and I agree with the authors claim that FedPS lays a strong groundwork for federated data pre-processing. - The paper is overall well-written, and the quality of presentation is good.

Weaknesses

The main weakness of the paper is the lack of novelty. The paper is a description of a library for federated data pre-processing. Besides the library itself, the only part of the paper that could be considered as a significant contribution is the proposed algorithm for Federated Power-Transforms, which aims at addressing numerical issues in power transform through logarithmic transformation. While I overall like this paper and think that it is helpful for the FL community, I don't think that I

Reviewer 02Rating 3Confidence 3

Strengths

1. The paper introduces a robust methodology for federated data preprocessing through the FedPS tool, which leverages aggregated statistics and data sketching techniques. 2. The authors tackle numerical challenges associated with power transforms, which have been a limitation in previous research. By employing log-space computations and constrained optimization, the proposed Federated Power Transform algorithm enhances numerical stability and achieves superlinear convergence rates. 3. The paper

Weaknesses

1. the scope of these experiments is limited in terms of the variety of datasets and models tested. very primitive models are tested, where models like resnet family etc should be experimentally tested with the proposed method. 2. While the paper discusses the necessity of federated data preprocessing in non-IID settings, it does not include detailed experimental results that specifically demonstrate the performance of the proposed methods under non-IID conditions. 3. A detailed privacy analysi

Reviewer 03Rating 3Confidence 4

Strengths

- Data preprocessing is often useful in practice. The consideration of such preprocessing operations in a federated learning scenario can have some practical usefulness.

Weaknesses

- It is not clear what is the main contribution of this paper. It seems to be a straightforward combination of several existing techniques. This is also suggested in the list of main contributions on page 2, which does not include any fundamental technical problem that this paper solves. - The main paper does not discuss any unique characteristic of federated learning problems, where the privacy of data at clients, often including their statistics, needs to be preserved. There is some discussio

Reviewer 04Rating 3Confidence 4

Strengths

The problem of data preprocessing/analytics in federated networks is important. FedPS implements a set of data preprocessing tools, using tools like random sketching. It considers some numerical issues in one algorithm, and makes the implemented open-sourced as well.

Weaknesses

Algorithmic contributions of this paper are limited. It centers around implementations of previous algorithms. It uses a known Log-Sum-Exp trick to handle numerical instabilities of power transform, as well as clipping the data when their absolute values are too big. For lots of the analytics tasks (estimating quantiles, estimating heavy hitters, etc) in federated learning, simply adding or merging local statistics may not be optimal. For instance, averaging local medians wouldn’t give us globa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Ethics and Social Impacts of AI · Stochastic Gradient Optimization Techniques