A Synthetic Dataset for Personal Attribute Inference
Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev

TL;DR
This paper introduces SynthPAI, a synthetic dataset generated via LLMs to facilitate research on privacy risks associated with personal attribute inference from online texts, addressing data privacy concerns.
Contribution
The authors created a novel synthetic dataset and simulation framework for Reddit, enabling privacy-preserving research on personal attribute inference using LLMs.
Findings
Humans struggle to distinguish synthetic comments from real ones.
Synthetic data enables consistent inference results across multiple LLMs.
The dataset supports privacy research without compromising real user data.
Abstract
Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users world-wide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose -- the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. We take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBig Data Technologies and Applications · Recommender Systems and Techniques · Data Quality and Management
MethodsChain-of-thought prompting · Focus
