RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation

Le Vu Anh; Nguyen Viet Anh; Mehmet Dik; Luong Van Nghia

arXiv:2506.15513·cs.LG·June 19, 2025

RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation

Le Vu Anh, Nguyen Viet Anh, Mehmet Dik, Luong Van Nghia

PDF

Open Access

TL;DR

RePCS is a lightweight, model-agnostic diagnostic tool that detects whether retrieval-augmented generation models rely on memorized data or actual retrieved evidence, ensuring safer and more reliable LLM outputs.

Contribution

RePCS introduces a novel, gradient-free method to diagnose data memorization in RAG systems by comparing inference paths using KL divergence, with theoretical guarantees and practical efficiency.

Findings

01

RePCS achieves 0.918 ROC-AUC on Prompt-WNQA.

02

Outperforms prior methods by 6.5 percentage points.

03

Requires less than 5% additional latency.

Abstract

Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs. We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions. A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization. This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies