FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models
Pukang Ye, Junwei Luo, Xiaolei Dong, Yunbo Yang

TL;DR
FedRW introduces a privacy-preserving, reweighting-based approach to handle data duplication in federated language model training, improving efficiency, generalization, and privacy without relying on trusted third parties.
Contribution
It is the first framework to perform soft deduplication via sample reweighting in federated LLM training without trusted third parties.
Findings
Achieves up to 28.78x speedup in preprocessing
Improves perplexity by approximately 11.42%
Provides enhanced security guarantees
Abstract
Data duplication within large-scale corpora often impedes large language models' (LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
