TL;DR
ICVul is a high-quality, well-labeled C/C++ vulnerability dataset with comprehensive metadata and VCCs, designed to improve machine learning-based vulnerability detection.
Contribution
The paper introduces ICVul, a new vulnerability dataset with enhanced label quality, metadata, and VCC tracing, addressing data reliability issues in existing datasets.
Findings
ICVul contains over X vulnerabilities with detailed metadata.
The dataset improves label accuracy using the ESC technique.
ICVul is publicly available for research use.
Abstract
Machine learning-based software vulnerability detection requires high-quality datasets, which is essential for training effective models. To address challenges related to data label quality, diversity, and comprehensiveness, we constructed ICVul, a dataset emphasizing data quality and enriched with comprehensive metadata, including Vulnerability-Contributing Commits (VCCs). We began by filtering Common Vulnerabilities and Exposures from the NVD, retaining only those linked to GitHub fix commits. Then we extracted functions and files along with relevant metadata from these commits and used the SZZ algorithm to trace VCCs. To further enhance label reliability, we developed the ESC (Eliminate Suspicious Commit) technique, ensuring credible data labels. The dataset is stored in a relational-like database for improved usability and data integrity. Both ICVul and its construction framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
