TL;DR
CUE-R introduces an intervention-based framework to evaluate the operational utility of individual evidence items in retrieval-augmented generation, revealing their impact on answer correctness and faithfulness.
Contribution
The paper presents CUE-R, a novel lightweight method for measuring the utility of evidence items in RAG through targeted perturbations and utility axes analysis.
Findings
Removing or replacing evidence harms correctness and faithfulness.
Duplicating evidence often does not change behavior significantly.
Multi-hop evidence interactions are non-additive, affecting performance more than single supports.
Abstract
As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
