EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
Amin Banayeeanzade, Qingchuan Yang, Deqing Fu, Spencer Hong, Erin Babinsky, Alfy Samuel, Anoop Kumar, Robin Jia, Sai Praneeth Karimireddy

TL;DR
EPSVec is a novel differentially-private method that efficiently generates high-quality synthetic data by using dataset vectors to steer large language models, reducing privacy costs and improving utility especially with limited data.
Contribution
We introduce EPSVec, a lightweight, privacy-preserving technique that uses dataset vectors to steer LLMs for synthetic data generation, decoupling privacy from sample size and enhancing efficiency.
Findings
EPSVec outperforms existing methods in distributional alignment.
It achieves higher downstream utility in low-data regimes.
The approach significantly reduces computational overhead.
Abstract
High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Topic Modeling · Mobile Crowdsensing and Crowdsourcing
