Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection
Niklas Risse, Jing Liu, Marcel B\"ohme

TL;DR
This paper critically examines the effectiveness of current machine learning benchmarks for vulnerability detection, revealing that high scores often result from spurious correlations and lack of contextual understanding, thus questioning their validity.
Contribution
It highlights the limitations of existing ML4VD datasets and benchmarks, proposing the need for more meaningful evaluation methods and alternative problem formulations.
Findings
High scores achieved with minimal context due to dataset biases
Vulnerable functions often depend on calling context for true vulnerability
Current benchmarks can be exploited without genuine vulnerability detection
Abstract
According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem: Given a function, does it contain a security flaw? From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called. In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function "vulnerable" if it was involved in a patch of an actual security flaw and confirmed to cause the program's vulnerability. It is "non-vulnerable" otherwise. We find that in almost all cases this decision cannot be made without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
