PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

Sriram Selvam; Anneswa Ghosh

arXiv:2505.12238·cs.CL·May 20, 2025

PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

Sriram Selvam, Anneswa Ghosh

PDF

Open Access 1 Repo 2 Datasets

TL;DR

PANORAMA is a large-scale synthetic dataset designed to emulate real-world sensitive data, enabling the study of PII memorization in large language models to improve privacy safeguards.

Contribution

We created a realistic, diverse synthetic PII dataset for studying memorization in LLMs, addressing the lack of comprehensive datasets for privacy risk analysis.

Findings

01

Memorization increases with data repetition.

02

Content type influences memorization risk.

03

PANORAMA enables effective privacy risk assessment.

Abstract

The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

selvamsriram/panorama-datagen
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy, Security, and Data Protection · Mental Health via Writing · Authorship Attribution and Profiling