Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems
Mengyu Yao, Ziqi Zhang, Ning Luo, Shaofei Li, Yifeng Cai, Xiangqun Chen, Yao Guo, Ding Li

TL;DR
This paper introduces RAGCrawler, an adaptive, knowledge graph-guided attack method that effectively steals knowledge from retrieval-augmented generation systems, revealing significant vulnerabilities and the need for better protection.
Contribution
It formulates RAG knowledge-base stealing as an adaptive stochastic coverage problem and develops RAGCrawler, a novel attack method that outperforms existing heuristics in efficiency and effectiveness.
Findings
Achieves 66.8% average coverage within 1,000 queries
Reduces queries needed to reach 70% coverage by 4.03x
Enables surrogate reconstruction with high answer similarity
Abstract
Stealing attacks pose a persistent threat to the intellectual property of deployed machine-learning systems. Retrieval-augmented generation (RAG) intensifies this risk by extending the attack surface beyond model weights to knowledge base that often contains IP-bearing assets such as proprietary runbooks, curated domain collections, or licensed documents. Recent work shows that multi-turn questioning can gradually steal corpus content from RAG systems, yet existing attacks are largely heuristic and often plateau early. We address this gap by formulating RAG knowledge-base stealing as an adaptive stochastic coverage problem (ASCP), where each query is a stochastic action and the goal is to maximize the conditional expected marginal gain (CMG) in corpus coverage under a query budget. Bridging ASCP to real-world black-box RAG knowledge-base stealing raises three challenges: CMG is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security · Information Retrieval and Search Behavior · Topic Modeling
