Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
Huining Cui, Wei Liu

TL;DR
This paper introduces RAGCharacter, a two-pass forensic framework for character-level traceback of poisoned spans in retrieval-augmented generation, enabling finer-grained evidence auditing.
Contribution
It presents a novel prompt-conditioned, black-box character-level traceback method and an evaluation protocol for localizing poisoned evidence in RAG systems.
Findings
RAGCharacter outperforms baselines in localization accuracy.
It achieves a good balance between localization precision and over-attribution.
The method is effective across multiple datasets, attack types, and models.
Abstract
Retrieval-augmented generation (RAG) improves factual grounding by conditioning large language models on retrieved evidence, but it also opens a data-layer attack surface: poisoned corpus entries can steer outputs without changing model parameters. Existing defenses and traceback methods are largely passage-level, which is too coarse for modern attacks whose effective payload may be a short fabricated claim, trigger phrase, or hidden instruction embedded inside an otherwise benign chunk. We study black-box character-level poison traceback in RAG and present RAGCharacter, a two-pass forensic framework that localizes the responsible retrieved span for a concrete misgeneration event. Pass-0 runs standard RAG while logging a prompt-anchored execution trace. Pass-1 re-enters a triggered trace and performs event-conditioned traceback over prompt-used evidence via budgeted counterfactual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
