GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

Jeongsoo Lee; Daeyong Kwon; Kyohoon Jin

arXiv:2508.16994·cs.CL·December 16, 2025

GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

Jeongsoo Lee, Daeyong Kwon, Kyohoon Jin

PDF

1 Video

TL;DR

GRADE introduces a comprehensive evaluation framework for RAG systems, modeling task difficulty through reasoning depth and semantic distance, enabling detailed analysis of multi-hop reasoning performance.

Contribution

It proposes a novel 2D difficulty matrix for RAG evaluation, incorporating reasoning steps and semantic distance, and creates a synthetic dataset for controlled multi-hop QA testing.

Findings

01

Error rates correlate with difficulty measures

02

Framework enables fine-grained RAG performance analysis

03

Scalable approach for evaluating multi-hop reasoning

Abstract

Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose GRADE, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation· underline