Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

Alberto Cattaneo; Carlo Luschi; Daniel Justus

arXiv:2511.04473·cs.LG·December 5, 2025

Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

Alberto Cattaneo, Carlo Luschi, Daniel Justus

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces SynthKGQA, a framework for generating high-quality QA datasets from knowledge graphs, enabling better training and evaluation of KG-augmented language models, and presents GTSQA for testing zero-shot generalization.

Contribution

The paper presents SynthKGQA for creating detailed QA datasets from any knowledge graph, and introduces GTSQA to evaluate zero-shot generalization of KG retrievers.

Findings

01

SynthKGQA enables more informative benchmarking of KG retrievers.

02

Models trained with SynthKGQA data perform better on knowledge graph tasks.

03

GTSQA tests zero-shot generalization of KG-augmented LLMs.

Abstract

Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, an LLM-powered framework for generating high-quality Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over questions. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. **Addresses a Core Problem with a Verifiable Solution:** The paper accurately identifies the central bottleneck in KG-RAG evaluation and training: the lack of ground-truth subgraphs. The proposed `SynthKGQA` framework provides a sophisticated and powerful solution. By using an "LLM-propose + SPARQL-validate" loop, it programmatically guarantees the factual consistency of the generated (question, SPARQL, answer, subgraph) tuples. Its 0.47% validation failure rate is far lower than alternative

Weaknesses

1. **Severe "Closed-Loop Evaluation" and Synthetic-to-Real Generalization Gap** This paper's most critical limitation is its "closed-loop" evaluation. While Section 6 effectively demonstrates a "synthetic-to-synthetic" gain (training on GTSQA improves performance on GTSQA), the paper completely lacks the most crucial experiment: demonstrating "synthetic-to-real" generalization. It fails to show if a model trained on GTSQA outperforms an SP-trained model on a real-world, human-created benchmark

Reviewer 02Rating 2Confidence 3

Strengths

The paper highlights that by using ground-truth subgraphs it is possible to train better KG retrievers (Table 3). The description of the generation framework is clear and well-summarized by Figure 1.

Weaknesses

The authors briefly mention concurrent works in the main text, relegating a more detailed comparison to Appendix E. These are the most relevant works to this paper and should be discussed in more detail within the main text. The two concurrent works mentioned in the paper use ground-truth subgraphs to generate question-answers; hence, it seems that the differences between the proposed approach and the most recent concurrent work primarily lie in aspects like the usage of all seed entities, SPARQ

Reviewer 03Rating 4Confidence 4

Strengths

1. They proposed SynthKGQA, a new framework that enables scalable creation of KGQA datasets using LLMs. 2. Using this framework, they introduced GTSQA, a new KGQA dataset.

Weaknesses

The framework proposed in this paper, called SynthKGQA, which utilizes an LLM, is a method that has already been used in other KGQA studies [1, 2]. If the framework claimed as the main contribution of this paper has already been introduced in previous works, it is difficult to regard this paper’s contribution as significant. The paper should provide a detailed explanation of how this approach differs from those existing methods. [1] Ronak Pradeep, Daniel Lee, Ali Mousavi, Jeffrey Pound, Yisi Sa

Code & Models

Datasets

Graphcore/GTSQA
dataset· 144 dl
144 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Graph Theory and Algorithms