Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma

TL;DR
This paper identifies critical quality issues in popular KGQA datasets and introduces KGQAGen, an LLM-based framework that generates a more reliable and challenging benchmark to better evaluate KGQA systems.
Contribution
The paper presents KGQAGen, a novel LLM-in-the-loop framework for improving dataset quality and constructing a large-scale, reliable KGQA benchmark grounded in Wikidata.
Findings
Average factual correctness of existing datasets is only 57%.
State-of-the-art models perform poorly on the new KGQAGen-10k benchmark.
KGQAGen exposes limitations of current KGQA systems.
Abstract
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Healthcare · Radiomics and Machine Learning in Medical Imaging
MethodsSparse Evolutionary Training
