Privacy-Preserving Dataset Combination

Keren Fuentes; Mimee Xu; Irene Chen

arXiv:2502.05765·cs.LG·October 20, 2025

Privacy-Preserving Dataset Combination

Keren Fuentes, Mimee Xu, Irene Chen

PDF

Open Access

TL;DR

This paper introduces { exttt{SecureKL}}, a privacy-preserving protocol for evaluating dataset utility internally without privacy leakage, enabling secure data sharing decisions in sensitive domains like healthcare.

Contribution

The paper presents { exttt{SecureKL}}, the first secure protocol for dataset-to-dataset evaluation with zero privacy leakage, facilitating privacy-aware data sharing decisions.

Findings

01

Achieves over 90% correlation with non-private evaluations.

02

Effectively identifies beneficial data collaborations in heterogeneous domains.

03

Outperforms privacy-agnostic utility assessments that leak information.

Abstract

Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements, due to the inability to \emph{privately} assess external data's utility. To resolve privacy and uncertainty tensions simultaneously, we introduce {\SecureKL}, the first secure protocol for dataset-to-dataset evaluations with zero privacy leakage, designed to be applied preceding data sharing. {\SecureKL} evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations, all without assuming downstream models. On real-world data, {\SecureKL} achieves high consistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Big Data Technologies and Applications · Data Quality and Management