Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond
Zeyu Gao, Junlin Zhou, Bolun Zhang, Yi He, Chao Zhang, Yuxin Cui, Hao Wang

TL;DR
This paper introduces mono, an LLM-powered framework that enhances vulnerability datasets by accurately labeling, analyzing context, and filtering undecidable patches, significantly improving vulnerability detection accuracy.
Contribution
mono is a novel framework that leverages large language models to improve vulnerability dataset quality through semantic classification, contextual analysis, and root cause filtering.
Findings
Corrects 31.0% of labeling errors in datasets.
Recovers 89% of inter-procedural vulnerabilities.
Reveals 16.7% of CVEs have undecidable patches.
Abstract
The quantity and quality of vulnerability datasets are essential for developing deep learning solutions to vulnerability-related tasks. Due to the limited availability of vulnerabilities, a common approach to building such datasets is analyzing security patches in source code. However, existing security patches often suffer from inaccurate labels, insufficient contextual information, and undecidable patches that fail to clearly represent the root causes of vulnerabilities or their fixes. These issues introduce noise into the dataset, which can mislead detection models and undermine their effectiveness. To address these issues, we present mono, a novel LLM-powered framework that simulates human experts' reasoning process to construct reliable vulnerability datasets. mono introduces three key components to improve security patch datasets: (i) semantic-aware patch classification for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Security and Verification in Computing
