Data Quality for Software Vulnerability Datasets
Roland Croft, M. Ali Babar, Mehdi Kholoosi

TL;DR
This paper investigates the quality of software vulnerability datasets, revealing significant inaccuracies and duplications that can adversely affect vulnerability prediction models and emphasizing the need for improved data quality assessment.
Contribution
It provides a comprehensive analysis of data quality issues in vulnerability datasets and highlights their impact on model performance, which has been underexplored in prior research.
Findings
20-71% of vulnerability labels are inaccurate
17-99% of data points are duplicated
Data quality issues significantly affect model training and benchmarking
Abstract
The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Software Engineering Research · Web Application Security Vulnerabilities
