A Case Study on the Impact of Anonymization Along the RAG Pipeline
Andreea-Elena Bodea, Stephen Meisenbacher, and Florian Matthes

TL;DR
This paper empirically investigates how the placement of anonymization within the Retrieval-Augmented Generation pipeline affects privacy and utility trade-offs, highlighting its importance for privacy risk mitigation.
Contribution
It systematically analyzes the impact of anonymization placement at different points in the RAG pipeline, a question not previously addressed in literature.
Findings
Anonymization at the dataset level preserves utility better.
Anonymization at the answer generation stage enhances privacy.
Placement of anonymization significantly influences privacy-utility balance.
Abstract
Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
