Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?
Triet H. M. Le, M. Ali Babar

TL;DR
This study investigates the impact of data imbalance on software vulnerability assessment and demonstrates that data augmentation techniques significantly improve model performance across multiple CVSS prediction tasks.
Contribution
The paper is the first large-scale analysis of data imbalance in SV assessment and shows that data augmentation can effectively mitigate this issue, enhancing predictive accuracy.
Findings
Data imbalance significantly affects SV assessment performance.
Simple text augmentation methods outperform baseline models.
Mitigating data imbalance improves model performance by up to 31.8% in MCC.
Abstract
Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Web Application Security Vulnerabilities · Software System Performance and Reliability
