Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch
Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan, Rui Xin, Shuyue Stella Li, Jaehun Jung, David Acuna, Qi Pang, Hanshen Xiao, G. Edward Suh, Sewoong Oh, Yulia Tsvetkov, Pang Wei Koh, Yejin Choi

TL;DR
Privasis is a large-scale synthetic dataset with rich private information, designed to facilitate research in privacy-sensitive AI applications, outperforming existing datasets and models in quality and diversity.
Contribution
We introduce Privasis, the largest fully synthetic private dataset from scratch, enabling advancements in privacy-sensitive AI research and privacy-preserving text sanitization.
Findings
Privasis contains 1.4 million records with 55.1 million annotated attributes.
Sanitization models trained on Privasis outperform GPT-5 and Qwen-3 235B.
Privasis accelerates research in privacy-sensitive domains.
Abstract
Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types,…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper introduces a million-scale, synthetic, privacy-rich corpus spanning many domains, with fine-grained, multi-level sanitisation targets and instructions. The paper goes beyond prior PII-span datasets by supporting removal and graded abstraction across long records and arbitrary categories. The method is strong. It uses controlled generation with rich attribute annotation and grouping, and a rigorous hierarchical evaluation that detects direct, inference, and proximity leaks while requir
The paper trains only on PRIVASIS-Sanitization and evaluates on its own test sets, without comparing the same model trained on other datasets (e.g. NAP^2 and other datasets mentioned in this paper). This makes it unclear whether PRIVASIS’s advantage comes from its design or just data scale and domain match. For LLM, in many cases, smaller, high-quality datasets could yield similar or better results like indicated in LIMA: Less Is More for Alignment Profiles are seeded from the US Social Securit
- Large and open source: As the authors point out, there are few large-scale open-source datasets for privacy tasks (there are some datasets developed with *proprietary* methods that are on the scale of 100k observations by AI4Privacy and the dataset of Selvam and Ghosh consists of 384,789 observations). PRIVASIS, at 4x the size of Selvam and Ghosh, arguably fills a need for large-scale open data. - Sampling Diversity: The authors build on existing literature which seeds synthetic data generatio
## Privacy Safety - A central claim is that PRIVASIS is "privacy-safe" because it is conditioned only on public name databases. But, ultimately LLMs are going to output content that is comparatively likely to co-occur with the provided context. Given that the dataset is generated by sampling from LLMs, the extent to which the dataset is truly privacy-safe is going to be a function of the safety of the underlying models. - The current test of safety relies on sampling 100 profiles and querying (p
* The auxiliary control variable approach combined with Vendi score-based diversity preservation is elegant and well-motivated. * With 1.2M records, 44M attributes, and coverage across 10 primary domain categories, PRIVASIS represents an orders-of-magnitude leap over existing privacy datasets. The hierarchical categorization into 42 subcategories and support for contextual, instruction-based sanitization beyond fixed PII categories addresses real limitations in current privacy-preserving approa
* The paper omits several important text-to-text privatization baselines [1,2,3]. It would be beneficial to discuss these works and, if feasible, include them as additional baselines for comparison. * It would also strengthen the paper if the authors considered multiple threat models, such as the Static and Adaptive Attacker settings described in [4]. * The pipeline relies heavily on specific LLM capabilities (GPT-OSS-120B for 62.6% of records) without thoroughly investigating how synthesis qua
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Topic Modeling · Authorship Attribution and Profiling
