When Machine Learning Meets Vulnerability Discovery: Challenges and Lessons Learned
Sima Arasteh, Christophe Hauser

TL;DR
This paper discusses the challenges of applying machine learning to software vulnerability discovery, highlighting issues with dataset transparency, model choices, and evaluation, while sharing insights from two related research tools.
Contribution
It identifies key challenges in ML-based vulnerability detection and offers insights from previous tools to guide future research in the field.
Findings
Lack of detailed dataset statistics hampers evaluation.
Training on semantically similar functions raises concerns.
Model choice and granularity significantly impact effectiveness.
Abstract
In recent years, machine learning has demonstrated impressive results in various fields, including software vulnerability detection. Nonetheless, using machine learning to identify software vulnerabilities presents new challenges, especially regarding the scale of data involved, which was not a factor in traditional methods. Consequently, in spite of the rise of new machine-learning-based approaches in that space, important shortcomings persist regarding their evaluation. First, researchers often fail to provide concrete statistics about their training datasets, such as the number of samples for each type of vulnerability. Moreover, many methods rely on training with semantically similar functions rather than directly on vulnerable programs. This leads to uncertainty about the suitability of the datasets currently used for training. Secondly, the choice of a model and the level of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Information and Cyber Security · Web Application Security Vulnerabilities
