JavaVFC: Java Vulnerability Fixing Commits from Open-source Software
Tan Bui, Yan Naing Tun, Yiran Cheng, Ivana Clairine Irsan, Ting Zhang,, Hong Jin Kang

TL;DR
This paper introduces a comprehensive, manually verified dataset of Java vulnerability-fixing commits from open-source projects, facilitating research in vulnerability analysis, detection, and repair.
Contribution
The paper presents a large, high-quality dataset of Java vulnerability-fixing commits derived from GitHub, with a rigorous labeling process and two dataset variants for research use.
Findings
Dataset includes 784 verified VFCs and 16,837 automatically identified VFCs.
The keyword filtering approach achieved a precision of 0.7.
The dataset is publicly available for research in vulnerability analysis.
Abstract
We present a comprehensive dataset of Java vulnerability-fixing commits (VFCs) to advance research in Java vulnerability analysis. Our dataset, derived from thousands of open-source Java projects on GitHub, comprises two variants: JavaVFC and JavaVFC-extended. The dataset was constructed through a rigorous process involving heuristic rules and multiple rounds of manual labeling. We initially used keywords to filter candidate VFCs based on commit messages, then refined this keyword set through iterative manual labeling. The final labeling round achieved a precision score of 0.7 among three annotators. We applied the refined keyword set to 34,321 open-source Java repositories with over 50 GitHub stars, resulting in JavaVFC with 784 manually verified VFCs and JavaVFC-extended with 16,837 automatically identified VFCs. Both variants are presented in a standardized JSONL format for easy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research
