Measuring Privacy vs. Fidelity in Synthetic Social Media Datasets
Henry Tari, Adriana Iamnitchi

TL;DR
This paper evaluates the privacy risks and data fidelity of synthetic social media posts generated by large language models, proposing a framework to quantify privacy leakage and analyze the privacy-fidelity trade-off.
Contribution
It introduces a novel methodology for assessing privacy in synthetic text data using authorship attribution attacks and examines the privacy-fidelity balance in social media datasets.
Findings
Synthetic posts show reduced re-identification risk compared to real data.
Higher fidelity in synthetic data correlates with increased privacy leakage.
The proposed framework effectively quantifies privacy risks in synthetic social media text.
Abstract
Synthetic data is increasingly used to support research without exposing sensitive user content. Social media data is one of the types of datasets that would hugely benefit from representative synthetic equivalents that can be used to bootstrap research and allow reproducibility through data sharing. However, recent studies show that (tabular) synthetic data is not inherently privacy-preserving. Much less is known, however, about the privacy risks of synthetically generated unstructured texts. This work evaluates the privacy of synthetic Instagram posts generated by three state-of-the-art large language models using two prompting strategies. We propose a methodology that quantifies privacy by framing re-identification as an authorship attribution attack. A RoBERTa-large classifier trained on real posts achieved 81\% accuracy in authorship attribution on real data, but only 16.5--29.7\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Privacy-Preserving Technologies in Data
