Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection
Rijha Safdar, Danyail Mateen, Syed Taha Ali, M. Umer Ashfaq, Wajahat Hussain

TL;DR
This paper shows that high-quality, diverse datasets and the choice of model architecture significantly improve the ability of AI systems to detect software vulnerabilities across unseen codebases, advancing generalization.
Contribution
Introduction of VulGate, a high-quality dataset, and benchmarking of encoder-only and decoder-only models to improve vulnerability detection generalization.
Findings
Encoder-based models outperform others in accuracy and generalization.
VulGate dataset enhances detection performance and robustness.
Achieved 6.8% recall improvement on BigVul dataset.
Abstract
AI-based solutions demonstrate remarkable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investigate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize. For this purpose, we introduce VulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mislabeled and duplicate samples, updating new vulnerabilities, incorporating additional metadata, integrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detection. We also introduce and benchmark multiple encoder-only and decoder-only models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
