When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su

TL;DR
This paper introduces GraphRAG-Bench, a comprehensive benchmark to evaluate when graph retrieval-augmented generation (GraphRAG) outperforms traditional RAG, focusing on hierarchical knowledge retrieval and reasoning tasks.
Contribution
It provides a systematic evaluation framework and guidelines for effective application of GraphRAG in various scenarios.
Findings
GraphRAG outperforms RAG in complex reasoning tasks.
GraphRAG's benefits are most evident in hierarchical knowledge retrieval.
The benchmark covers diverse tasks from fact retrieval to creative generation.
Abstract
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper studied a fundamental and important problem. GraphRAG is trending, demonstrating good motivation. The proposed benchmark provides rooms to systematically examine the advantage of different GraphRAG systems. 2. This manuscript identifies key limitations of existing RAG benchmarking datasets, including neglecting the evaluation of logical reasoning, limited corpus coverage, and focusing only on end results. 3. Apart from general QA, the proposed benchmark includes a set of tasks,
1. The dependence of GraphRAG on LLM ability. The authors tested two models (GPT-4o-mini and Qwen2.5-14B) on the benchmark. However, the analysis of how GraphRAG depends on the ability (size) of LLMs is missing. It would be nice if the author could give some analysis on the minimum size for a successful GraphRAG. 2. Some of the multi-hop QA datasets, which contains QA pairs with various question types and different difficulties, such as CWQ, MuSiQue, and 2WikiMultihopQA, are not discussed in th
This work tackles the key open question of when graph-based retrieval truly benefits RAG systems, bridging a major gap in empirical understanding. Through dense, well-controlled experiments across domains and tasks, it provides strong, data-driven evidence supporting its conclusions and design insights.
1. The proposed four-level task hierarchy is claimed that task difficulty increases along the retrieval difficulty and reasoning complexity. However, this paper do not provide formal or operational definitions for these levels—there are no explicit thresholds for evidence quantity, reasoning steps, or context length that determine the boundaries between levels. 2. The benchmark does not disclose the absolute corpus size for the Novel and Medical datasets. Consequently, it remains unclear how Gr
- This benchmark focuses on tasks with different difficulty levels, reflecting real-world scenarios demanding complex logical synthesis. - This benchmark provides a comprehensive evaluation for GraphRAG with clear quantitative metrics, including graph structure, retrieval performance, efficiency, and final output quality. - Based on thorough experiments across multiple GraphRAG methods, LLMs, and tasks, this paper offers clear and practical guidelines on when GraphRAG outperforms traditional RAG
- The calculation process for some evaluation metrics is simplistic and unclear. For example, the paper does not explain how it determines whether a claim $c$ is supported by the context $C$ when calculating EVIDENCE RECALL and FAITHFULNESS. - The process of Logic and Evidence Extraction is not clearly described either in the main text or the appendix. While the use of GPT-4.1 is mentioned, the specific extraction procedure and output format are not illustrated clearly.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Is All You Need · WordPiece · Weight Decay · Multi-Head Attention · Attention Dropout · Dropout · Dense Connections
