LeakDojo: Decoding the Leakage Threats of RAG Systems

Maosen Zhang; Jianshuo Dong; Boting Lu; Wenyue Li; Xiaoping Zhang; Tianwei Zhang; and Han Qiu

arXiv:2605.05818·cs.CR·May 8, 2026

LeakDojo: Decoding the Leakage Threats of RAG Systems

Maosen Zhang, Jianshuo Dong, Boting Lu, Wenyue Li, Xiaoping Zhang, Tianwei Zhang, and Han Qiu

PDF

1 Repo

TL;DR

LeakDojo is a framework for systematically evaluating leakage risks in Retrieval-Augmented Generation systems, revealing how query generation, instructions, and model capabilities influence data leakage.

Contribution

It introduces LeakDojo, a configurable tool for benchmarking RAG leakage, and provides insights into factors affecting leakage risks in LLM-based retrieval systems.

Findings

01

Query generation and adversarial instructions independently increase leakage.

02

Stronger instruction-following models have higher leakage risk.

03

Improving RAG faithfulness can lead to increased leakage.

Abstract

Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction-following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four datasets, and diverse RAG systems. Our study reveals that (1) query generation and adversarial instructions contribute independently to leakage, with overall leakage well approximated by their product; (2) stronger instruction-following capability correlates with higher leakage risk; and (3) improvements in RAG faithfulness can introduce increased leakage risk. These findings provide actionable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yeasen-z/LeakDojo
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.