Towards Better Benchmark Datasets for Inductive Knowledge Graph Completion

Harry Shomer; Jay Revolinsky; Jiliang Tang

arXiv:2406.11898·cs.AI·June 26, 2025

Towards Better Benchmark Datasets for Inductive Knowledge Graph Completion

Harry Shomer, Jay Revolinsky, Jiliang Tang

PDF

1 Repo 3 Reviews

TL;DR

This paper identifies flaws in existing inductive knowledge graph completion datasets caused by shortcuts like PPR scores, proposes improved dataset construction methods, and benchmarks methods to better evaluate true inductive reasoning capabilities.

Contribution

The authors introduce a new dataset construction strategy for inductive KGC that mitigates shortcuts, enabling more accurate assessment of model capabilities.

Findings

01

Current datasets contain shortcuts that inflate performance.

02

New datasets reduce shortcut exploitation and provide clearer evaluation.

03

Benchmark results highlight genuine model strengths and weaknesses.

Abstract

Knowledge Graph Completion (KGC) attempts to predict missing facts in a Knowledge Graph (KG). Recently, there's been an increased focus on designing KGC methods that can excel in the inductive setting, where a portion or all of the entities and relations seen in inference are unobserved during training. Numerous benchmark datasets have been proposed for inductive KGC, all of which are subsets of existing KGs used for transductive KGC. However, we find that the current procedure for constructing inductive KGC datasets inadvertently creates a shortcut that can be exploited even while disregarding the relational information. Specifically, we observe that the Personalized PageRank (PPR) score can achieve strong or near SOTA performance on most datasets. In this paper, we study the root cause of this problem. Using these insights, we propose an alternative strategy for constructing inductive…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

1. **Addressing a Fundamental Problem in Existing Inductive Datasets**: The paper identifies that the shortest path distance (SPD) shortcut is prevalent among several existing inductive KG datasets. Furthermore, the authors demonstrate that all datasets generated by varying configurations of the existing data generation algorithm suffer from this problem. This indicates that the current inductive data generation strategy is fundamentally flawed, and addressing these flaws is crucial to genuinely

Weaknesses

**My main concern with this paper is that it misses key citations in Related Work**. Specifically, Gao et al. [1] was a concurrent work to Galkin et al. [2] that first provided a theoretical understanding of what is necessary for solving the (E, R) inductive KGC task. Gao et al. also introduced two new (E, R)-capable methods, ISDEA+ and DEq-InGram, with the latter being an improved version of the original InGram [3]. Additionally, two new (E, R) datasets, PediaTypes and WikiTopics, were proposed

Reviewer 02Rating 8Confidence 3

Strengths

-- Numerous machine learning and AI approaches have been introduced in the literature. To demonstrate the effectiveness of new models compared to older ones, several standard datasets have been established for evaluation and conclusion. However, issues with these datasets can lead to misleading conclusions within the community about model performance. Therefore, the paper's original motivation and the problem it addresses are both significant and valuable. – Overall, the paper is written well a

Weaknesses

– Some part of the paper requires a better and more detailed explanation, e.g., the description related to page rank which is important to better understand the issue. – that the deltaSPD of train and inference should be close is understandable. But it is not very clear why this partitioning solve the main raised issue. – the analysis of the tables should be more detailed and cover interpretation of various observations.

Reviewer 03Rating 6Confidence 4

Strengths

1. This work provides a clearer understanding of the capabilities and challenges in inductive KGC and the benchmark construction. 2. The manuscript is well-organized and easy to follow. 3. Experiments have verified the effectiveness of the proposed benchmarks.

Weaknesses

1. The technical contribution is kind of limited for a long paper. The applied graph partitioning strategy is straightforward and the research challenge in this paper is not significant. 2. There is a confusing point in the benchmark construction. Why the differences in SPD are beneficial from graph partitioning sampling? Since “the different in mean SPD” (also a typo in line 237) causes this shortcut, why not improve the negative sampling strategy directly? For example, select entities having

Code & Models

Repositories

HarryShomer/Better-Inductive-KGC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus