Source Attribution in Retrieval-Augmented Generation
Ikhtiyor Nematov, Tarik Kalai, Elizaveta Kuzmenko, Gabriele Fugagnoli, Dimitris Sacharidis, Katja Hose, Tomer Sagi

TL;DR
This paper explores the adaptation of Shapley value-based attribution methods to Retrieval-Augmented Generation systems, aiming to identify influential documents efficiently while addressing computational challenges.
Contribution
It systematically applies Shapley-based attribution to RAG, compares approximations, and evaluates their effectiveness in practical, complex scenarios.
Findings
Shapley approximations can closely mirror exact attributions.
SHAP methods reduce computational costs significantly.
Effective identification of critical documents in complex relationships.
Abstract
While attribution methods, such as Shapley values, are widely used to explain the importance of features or training data in traditional machine learning, their application to Large Language Models (LLMs), particularly within Retrieval-Augmented Generation (RAG) systems, is nascent and challenging. The primary obstacle is the substantial computational cost, where each utility function evaluation involves an expensive LLM call, resulting in direct monetary and time expenses. This paper investigates the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. Our work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
