Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models
Joshua Ward, Chi-Hua Wang, Guang Cheng

TL;DR
This paper introduces the Data Plagiarism Index (DPI), a new metric for assessing privacy risks in tabular generative models by measuring data-copying and its implications for privacy and fairness.
Contribution
The paper proposes DPI, a novel similarity metric and membership inference attack tailored for high-dimensional tabular data, addressing limitations of existing methods.
Findings
DPI effectively measures data-copying in tabular models.
Data-copying identified by DPI poses privacy and fairness risks.
Current models require more sophisticated techniques to mitigate data-copying.
Abstract
The promise of tabular generative models is to produce realistic synthetic data that can be shared and safely used without dangerous leakage of information from the training set. In evaluating these models, a variety of methods have been proposed to measure the tendency to copy data from the training dataset when generating a sample. However, these methods suffer from either not considering data-copying from a privacy threat perspective, not being motivated by recent results in the data-copying literature or being difficult to make compatible with the high dimensional, mixed type nature of tabular data. This paper proposes a new similarity metric and Membership Inference Attack called Data Plagiarism Index (DPI) for tabular data. We show that DPI evaluates a new intuitive definition of data-copying and characterizes the corresponding privacy risk. We show that the data-copying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Ethics and Social Impacts of AI
