Data Quality for Software Vulnerability Datasets

Roland Croft; M. Ali Babar; Mehdi Kholoosi

arXiv:2301.05456·cs.SE·January 16, 2023

Data Quality for Software Vulnerability Datasets

Roland Croft, M. Ali Babar, Mehdi Kholoosi

PDF

Open Access

TL;DR

This paper investigates the quality of software vulnerability datasets, revealing significant inaccuracies and duplications that can adversely affect vulnerability prediction models and emphasizing the need for improved data quality assessment.

Contribution

It provides a comprehensive analysis of data quality issues in vulnerability datasets and highlights their impact on model performance, which has been underexplored in prior research.

Findings

01

20-71% of vulnerability labels are inaccurate

02

17-99% of data points are duplicated

03

Data quality issues significantly affect model training and benchmarking

Abstract

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research · Software Engineering Research · Web Application Security Vulnerabilities