Impacts of Data Splitting Strategies on Parameterized Link Prediction Algorithms
Xinshan Jiao, Yuxin Luo, Yilin Bi, and Tao Zhou

TL;DR
This paper investigates how data splitting strategies affect the evaluation of parameterized link prediction algorithms, revealing significant overestimation due to information leakage and emphasizing the need for standardized protocols.
Contribution
It introduces the Loss Ratio metric to quantify performance overestimation and highlights the importance of proper data splitting for fair benchmarking.
Findings
Information leakage causes about 3.6% average overestimation in performance.
Heuristic and random-walk methods are more robust against data splitting issues.
Standardized data splitting strategies are essential for reproducible link prediction evaluation.
Abstract
Link prediction is a fundamental problem in network science, aiming to infer potential or missing links based on observed network structures. With the increasing adoption of parameterized models, the rigor of evaluation protocols has become critically important. However, a previously common practice of using the test set during hyperparameter tuning has led to human-induced information leakage, thereby inflating the reported model performance. To address this issue, this study introduces a novel evaluation metric, Loss Ratio, which quantitatively measures the extent of performance overestimation. We conduct large-scale experiments on 60 real-world networks across six domains. The results demonstrate that the information leakage leads to an average overestimation of about 3.6%, with the bias reaching over 15% for specific algorithms. Meanwhile, heuristic and random-walk-based methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
