Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond

Zeyu Gao; Junlin Zhou; Bolun Zhang; Yi He; Chao Zhang; Yuxin Cui; Hao Wang

arXiv:2506.03651·cs.CR·June 12, 2025

Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond

Zeyu Gao, Junlin Zhou, Bolun Zhang, Yi He, Chao Zhang, Yuxin Cui, Hao Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces mono, an LLM-powered framework that enhances vulnerability datasets by accurately labeling, analyzing context, and filtering undecidable patches, significantly improving vulnerability detection accuracy.

Contribution

mono is a novel framework that leverages large language models to improve vulnerability dataset quality through semantic classification, contextual analysis, and root cause filtering.

Findings

01

Corrects 31.0% of labeling errors in datasets.

02

Recovers 89% of inter-procedural vulnerabilities.

03

Reveals 16.7% of CVEs have undecidable patches.

Abstract

The quantity and quality of vulnerability datasets are essential for developing deep learning solutions to vulnerability-related tasks. Due to the limited availability of vulnerabilities, a common approach to building such datasets is analyzing security patches in source code. However, existing security patches often suffer from inaccurate labels, insufficient contextual information, and undecidable patches that fail to clearly represent the root causes of vulnerabilities or their fixes. These issues introduce noise into the dataset, which can mislead detection models and undermine their effectiveness. To address these issues, we present mono, a novel LLM-powered framework that simulates human experts' reasoning process to construct reliable vulnerability datasets. mono introduces three key components to improve security patch datasets: (i) semantic-aware patch classification for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vul337/mono
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Security and Verification in Computing