Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra; Suparna De; Nishanth Sastry; Saeed Fadaei

arXiv:2507.22930·cs.CL·August 1, 2025

Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra, Suparna De, Nishanth Sastry, Saeed Fadaei

PDF

Open Access

TL;DR

This paper introduces a methodology for generating synthetic datasets that mimic PII-revealing posts on social media, enabling privacy-preserving research into self-disclosure detection using large language models.

Contribution

It presents a novel approach to create and validate synthetic PII-labeled datasets from LLMs, addressing data sharing and privacy concerns in self-disclosure research.

Findings

01

Synthetic data closely resembles original posts in training models.

02

Synthetic data is unlinkable to original users.

03

Synthetic data is indistinguishable from real posts to humans.

Abstract

Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy, Security, and Data Protection · Hate Speech and Cyberbullying Detection · Mental Health via Writing