A Synthetic Dataset for Personal Attribute Inference

Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev

arXiv:2406.07217·cs.LG·November 5, 2024

A Synthetic Dataset for Personal Attribute Inference

Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

This paper introduces SynthPAI, a synthetic dataset generated via LLMs to facilitate research on privacy risks associated with personal attribute inference from online texts, addressing data privacy concerns.

Contribution

The authors created a novel synthetic dataset and simulation framework for Reddit, enabling privacy-preserving research on personal attribute inference using LLMs.

Findings

01

Humans struggle to distinguish synthetic comments from real ones.

02

Synthetic data enables consistent inference results across multiple LLMs.

03

The dataset supports privacy research without compromising real user data.

Abstract

Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users world-wide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose -- the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. We take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

RobinSta/SynthPAI
dataset· 201 dl
201 dl

Videos

A Synthetic Dataset for Personal Attribute Inference· slideslive

Taxonomy

TopicsBig Data Technologies and Applications · Recommender Systems and Techniques · Data Quality and Management

MethodsChain-of-thought prompting · Focus