FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models

Pukang Ye; Junwei Luo; Xiaolei Dong; Yunbo Yang

arXiv:2511.07505·cs.CR·November 12, 2025

FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models

Pukang Ye, Junwei Luo, Xiaolei Dong, Yunbo Yang

PDF

Open Access 1 Video

TL;DR

FedRW introduces a privacy-preserving, reweighting-based approach to handle data duplication in federated language model training, improving efficiency, generalization, and privacy without relying on trusted third parties.

Contribution

It is the first framework to perform soft deduplication via sample reweighting in federated LLM training without trusted third parties.

Findings

01

Achieves up to 28.78x speedup in preprocessing

02

Improves perplexity by approximately 11.42%

03

Provides enhanced security guarantees

Abstract

Data duplication within large-scale corpora often impedes large language models' (LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models· slideslive

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education