TL;DR
GRADE introduces a comprehensive evaluation framework for RAG systems, modeling task difficulty through reasoning depth and semantic distance, enabling detailed analysis of multi-hop reasoning performance.
Contribution
It proposes a novel 2D difficulty matrix for RAG evaluation, incorporating reasoning steps and semantic distance, and creates a synthetic dataset for controlled multi-hop QA testing.
Findings
Error rates correlate with difficulty measures
Framework enables fine-grained RAG performance analysis
Scalable approach for evaluating multi-hop reasoning
Abstract
Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose GRADE, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
