TL;DR
VulScribeR leverages large language models with retrieval-augmented generation to effectively augment vulnerable code datasets, significantly improving vulnerability detection performance across multiple datasets and models.
Contribution
The paper introduces VulScribeR, a novel LLM-based data augmentation method using prompt templates and three strategies to generate vulnerable code, outperforming state-of-the-art methods.
Findings
VulScribeR improves F1-score by up to 54% over SOTA methods.
The approach generates large-scale vulnerable datasets cost-effectively.
VulScribeR is effective across multiple datasets and vulnerability types.
Abstract
Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Softmax · Dense Connections · Dropout · Linear Layer · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay · BART
