An Empirical Study of the Imbalance Issue in Software Vulnerability Detection
Yuejun Guo, Qiang Hu, Qiang Tang, Yves Le Traon

TL;DR
This study empirically investigates the impact of data imbalance on deep learning-based software vulnerability detection across multiple datasets, revealing varied effectiveness of existing solutions and highlighting the need for tailored approaches.
Contribution
It provides the first comprehensive empirical analysis of imbalance issues in vulnerability detection, evaluating existing solutions and offering insights for future method development.
Findings
Focal loss improves precision in vulnerability detection.
Class-balanced loss enhances recall.
Over-sampling increases F1-score.
Abstract
Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Information and Cyber Security
