An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits
Maliheh Izadi, Pooya Rostami Mazrae, Tom Mens, Arie van Deursen

TL;DR
This study evaluates the impact of data leakage and temporal data splitting on link prediction models for software artifacts, introducing LinkFormer, which improves accuracy and generalizability using Transformer-based fine-tuning.
Contribution
The paper presents LinkFormer, a Transformer-based approach that enhances link prediction accuracy and generalizability by considering temporal data splits and transfer learning.
Findings
LinkFormer achieves 48% higher F1-score in project-based settings.
Temporal data splitting better simulates real-world scenarios.
Cross-project performance of LinkFormer is comparable to within-project results.
Abstract
To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Reliability and Analysis Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization
