Improving Data Curation of Software Vulnerability Patches through Uncertainty Quantification
Hui Chen, Yunhua Zhao, Kostadin Damevski

TL;DR
This paper introduces a novel approach using Uncertainty Quantification to improve the quality of software vulnerability patch datasets, enhancing downstream security tasks by filtering out low-utility patches.
Contribution
It evaluates various UQ techniques and proposes a heuristic to select high-utility vulnerability patches, improving dataset quality and model efficiency.
Findings
Model Ensemble and heteroscedastic models perform best for UQ in vulnerability datasets.
UQ-based filtering improves predictive performance of vulnerability models.
Significant reduction in training time and energy consumption achieved.
Abstract
The changesets (or patches) that fix open source software vulnerabilities form critical datasets for various machine learning security-enhancing applications, such as automated vulnerability patching and silent fix detection. These patch datasets are derived from extensive collections of historical vulnerability fixes, maintained in databases like the Common Vulnerabilities and Exposures list and the National Vulnerability Database. However, since these databases focus on rapid notification to the security community, they contain significant inaccuracies and omissions that have a negative impact on downstream software security quality assurance tasks. In this paper, we propose an approach employing Uncertainty Quantification (UQ) to curate datasets of publicly-available software vulnerability patches. Our methodology leverages machine learning models that incorporate UQ to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Software Engineering Research · Scientific Computing and Data Management
