Top Score on the Wrong Exam: On Benchmarking in Machine Learning for   Vulnerability Detection

Niklas Risse; Jing Liu; Marcel B\"ohme

arXiv:2408.12986·cs.CR·April 24, 2025

Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

Niklas Risse, Jing Liu, Marcel B\"ohme

PDF

Open Access

TL;DR

This paper critically examines the effectiveness of current machine learning benchmarks for vulnerability detection, revealing that high scores often result from spurious correlations and lack of contextual understanding, thus questioning their validity.

Contribution

It highlights the limitations of existing ML4VD datasets and benchmarks, proposing the need for more meaningful evaluation methods and alternative problem formulations.

Findings

01

High scores achieved with minimal context due to dataset biases

02

Vulnerable functions often depend on calling context for true vulnerability

03

Current benchmarks can be exploited without genuine vulnerability detection

Abstract

According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem: Given a function, does it contain a security flaw? From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called. In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function "vulnerable" if it was involved in a patch of an actual security flaw and confirmed to cause the program's vulnerability. It is "non-vulnerable" otherwise. We find that in almost all cases this decision cannot be made without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning