MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation
Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, Shaohua Wang

TL;DR
MegaVul is a large, comprehensive C/C++ vulnerability dataset derived from CVE data and open-source projects, enriched with multiple code representations, supporting security research and vulnerability detection.
Contribution
It introduces a new extensive vulnerability dataset with diverse representations, covering 169 vulnerability types and enabling improved security analysis tools.
Findings
Contains 17,380 vulnerabilities from 992 repositories
Includes four different transformed code representations
Supports vulnerability detection and severity assessment
Abstract
We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. We adopt advanced tools to ensure the extracted code integrality and enrich the code with four different transformed representations. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. Thus, MegaVul can be used for a variety of software security-related tasks including detecting vulnerabilities and assessing vulnerability severity. All information is stored in the JSON…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
