Privacy-Preserving Data Deduplication for Enhancing Federated Learning   of Language Models (Extended Version)

Aydin Abadi; Vishnu Asutosh Dasu; Sumanta Sarkar

arXiv:2407.08152·cs.CR·December 5, 2024

Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models (Extended Version)

Aydin Abadi, Vishnu Asutosh Dasu, Sumanta Sarkar

PDF

Open Access 1 Repo

TL;DR

This paper introduces EP-MPD, a privacy-preserving deduplication protocol for federated learning of language models, significantly improving efficiency and model performance while maintaining data privacy.

Contribution

The paper presents a novel, modular protocol for privacy-preserving deduplication in federated learning using two new Private Set Intersection variants.

Findings

01

Up to 19.62% improvement in perplexity

02

Up to 27.95% reduction in running time

03

Effective privacy-performance balance in large-scale federated learning

Abstract

Deduplication is a vital preprocessing step that enhances machine learning model performance and saves training time and energy. However, enhancing federated learning through deduplication poses challenges, especially regarding scalability and potential privacy violations if deduplication involves sharing all clients' data. In this paper, we address the problem of deduplication in a federated setup by introducing a pioneering protocol, Efficient Privacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes duplicates from multiple clients' datasets without compromising data privacy. EP-MPD is constructed in a modular fashion, utilizing two novel variants of the Private Set Intersection protocol. Our extensive experiments demonstrate the significant benefits of deduplication in federated learning of large language models. For instance, we observe up to 19.62\% improvement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vdasu/deduplication
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data

MethodsSparse Evolutionary Training