Explaining the Contributing Factors for Vulnerability Detection in Machine Learning
Esma Mouine, Yan Liu, Lu Xiao, Rick Kazman, Xiao Wang

TL;DR
This paper investigates how different features and machine learning models affect vulnerability detection accuracy across multiple software projects, highlighting effective combinations and transferability limitations.
Contribution
It systematically evaluates the impact of various vulnerability features and models, providing a baseline for future research and practical applications.
Findings
Bag-of-words with random forest improves detection accuracy by 4%.
Transferability of vulnerability signatures across projects is limited.
NLP-based code features enhance vulnerability detection.
Abstract
There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to automatically detect software vulnerabilities. A fundamental but unresolved research question is: how do different factors in the mining and learning process impact the accuracy of identifying vulnerabilities in software projects of varying characteristics? Substantial research has been dedicated in this area, including source code static analysis, software repository mining, and NLP-based machine learning. However, practitioners lack experience regarding the key factors for building a baseline model of the state-of-the-art. In addition, there lacks of experience regarding the transferability of the vulnerability signatures from project to project. This study investigates how the combination of different vulnerability features and three representative machine learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications
MethodsSparse Evolutionary Training
