Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Tevin Atwal; Chan Nam Tieu; Yefeng Yuan; Zhan Shi; Yuhong Liu; Liang Cheng

arXiv:2507.18055·cs.CL·July 25, 2025

Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng

PDF

Open Access

TL;DR

This paper evaluates the diversity and privacy of synthetic reviews generated by LLMs, identifies their limitations, and proposes a prompt-based method to improve diversity while maintaining privacy.

Contribution

It introduces comprehensive metrics for assessing synthetic review diversity and privacy, and proposes a prompt-based approach to enhance diversity without compromising privacy.

Findings

01

LLMs have significant limitations in generating diverse synthetic reviews.

02

Current synthetic data poses privacy risks such as re-identification.

03

Prompt-based techniques can improve diversity while preserving privacy.

Abstract

The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data