Automated software vulnerability detection with machine learning
Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir,, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno,, Jonathan R. Key, Paul M. Ellingwood, Erik Antelman, Alan Mackay, Marc W., McConley, Jeffrey M. Opper, Peter Chin, Tomo Lazovich

TL;DR
This paper introduces a machine learning approach for detecting security vulnerabilities in C and C++ code, leveraging large datasets and combining deep learning with traditional models to improve detection accuracy.
Contribution
It presents a data-driven vulnerability detection method that combines deep neural networks with tree-based models, outperforming traditional approaches.
Findings
Deep models combined with random forests yield best performance.
Source-based models outperform build artifact-based models.
Highest model achieves AUPRC of 0.49 and AUROC of 0.87.
Abstract
Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabilities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Advanced Malware Detection Techniques
