Privacy Implications of Retrieval-Based Language Models

Yangsibo Huang; Samyak Gupta; Zexuan Zhong; Kai Li; Danqi Chen

arXiv:2305.14888·cs.CL·May 25, 2023·1 cites

Privacy Implications of Retrieval-Based Language Models

Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen

PDF

Open Access 1 Repo

TL;DR

Retrieval-based language models, while improving interpretability and factuality, pose increased privacy risks by leaking private data, necessitating careful design and mitigation strategies to balance utility and privacy.

Contribution

This study is the first to analyze privacy risks in retrieval-based LMs, especially $k$NN-LMs, and explores mitigation techniques to enhance privacy without sacrificing utility.

Findings

01

$k$NN-LMs are more prone to leaking private data than parametric models.

02

Simple sanitization can eliminate privacy risks when sensitive information is easily detectable.

03

Decoupling query and key encoders improves the privacy-utility trade-off.

Abstract

Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly $k$ NN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that $k$ NN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-sysml/knnlm_privacy
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Topic Modeling