The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)
Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han, Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, Jiliang Tang

TL;DR
This paper investigates privacy risks in retrieval-augmented generation (RAG) systems, revealing vulnerabilities to private data leakage and proposing insights for enhancing privacy protections in RAG-enabled language models.
Contribution
The study introduces novel attack methods to demonstrate RAG's privacy vulnerabilities and provides new insights into balancing data privacy in RAG systems.
Findings
RAG systems are vulnerable to private data leakage through novel attack methods.
RAG can reduce the leakage of training data from large language models.
Empirical evidence shows both privacy risks and mitigation strategies in RAG systems.
Abstract
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. In this work, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risk brought by RAG on the retrieval data, we further reveal that RAG can mitigate the leakage of the LLMs' training data. Overall, we provide new insights in this paper for privacy protection of retrieval-augmented LLMs, which benefit both LLMs and RAG systems builders. Our code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPrivacy, Security, and Data Protection · Privacy-Preserving Technologies in Data · Technology Adoption and User Behaviour
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding · Weight Decay · Dropout · Multi-Head Attention · Linear Warmup With Linear Decay · Attention Dropout
