PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs
Sriram Selvam, Anneswa Ghosh

TL;DR
PANORAMA is a large-scale synthetic dataset designed to emulate real-world sensitive data, enabling the study of PII memorization in large language models to improve privacy safeguards.
Contribution
We created a realistic, diverse synthetic PII dataset for studying memorization in LLMs, addressing the lack of comprehensive datasets for privacy risk analysis.
Findings
Memorization increases with data repetition.
Content type influences memorization risk.
PANORAMA enables effective privacy risk assessment.
Abstract
The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy, Security, and Data Protection · Mental Health via Writing · Authorship Attribution and Profiling
