Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang; Zhuorui Jiang; Hongliang Chi; Haoyang Chen; Mohammed Elkoumy; Fali Wang; Qiong Wu; Zhengyi Zhou; Shirui Pan; Suhang Wang; Yao Ma

arXiv:2505.23495·cs.CL·November 5, 2025

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma

PDF

Open Access 1 Video

TL;DR

This paper identifies critical quality issues in popular KGQA datasets and introduces KGQAGen, an LLM-based framework that generates a more reliable and challenging benchmark to better evaluate KGQA systems.

Contribution

The paper presents KGQAGen, a novel LLM-in-the-loop framework for improving dataset quality and constructing a large-scale, reliable KGQA benchmark grounded in Wikidata.

Findings

01

Average factual correctness of existing datasets is only 57%.

02

State-of-the-art models perform poorly on the new KGQAGen-10k benchmark.

03

KGQAGen exposes limitations of current KGQA systems.

Abstract

Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking· slideslive

Taxonomy

TopicsMachine Learning in Healthcare · Radiomics and Machine Learning in Medical Imaging

MethodsSparse Evolutionary Training