Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?
Triet H. M. Le, M. Ali Babar

TL;DR
This paper evaluates the quality and effectiveness of auto-labeled software vulnerability data, revealing significant noise but also substantial improvements in prediction performance when used appropriately.
Contribution
It provides a comprehensive analysis of auto-labeled SV data quality and demonstrates how noise-reduction can enhance SV prediction models.
Findings
Over 50% of auto-labeled SVs are noisy and misaligned with reported data.
SV prediction models with auto-labeled data outperform original models by up to 22% MCC and 90% recall.
Applying noise-reduction methods can improve the utility of auto-labeled SV data.
Abstract
Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Software Engineering Research · Software System Performance and Reliability
