An Empirical Study on Data Leakage and Generalizability of Link   Prediction Models for Issues and Commits

Maliheh Izadi; Pooya Rostami Mazrae; Tom Mens; Arie van Deursen

arXiv:2211.00381·cs.SE·April 25, 2023

An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits

Maliheh Izadi, Pooya Rostami Mazrae, Tom Mens, Arie van Deursen

PDF

Open Access

TL;DR

This study evaluates the impact of data leakage and temporal data splitting on link prediction models for software artifacts, introducing LinkFormer, which improves accuracy and generalizability using Transformer-based fine-tuning.

Contribution

The paper presents LinkFormer, a Transformer-based approach that enhances link prediction accuracy and generalizability by considering temporal data splits and transfer learning.

Findings

01

LinkFormer achieves 48% higher F1-score in project-based settings.

02

Temporal data splitting better simulates real-world scenarios.

03

Cross-project performance of LinkFormer is comparable to within-project results.

Abstract

To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Reliability and Analysis Research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization