Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Amine Lbath

arXiv:2603.17974·cs.SE·March 19, 2026

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Amine Lbath

PDF

Open Access

TL;DR

This paper introduces an automated method to generate large-scale, realistic, repository-level datasets for software vulnerability detection by injecting vulnerabilities into real repositories and creating reproducible exploits, enhancing training and evaluation.

Contribution

It presents a novel automated benchmark generator that scales vulnerability dataset creation and explores adversarial co-evolution to improve detection robustness.

Findings

01

Automated injection of vulnerabilities into real repositories.

02

Generation of reproducible proof-of-vulnerability exploits.

03

Enhanced dataset scale and realism for vulnerability detection.

Abstract

Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Software Engineering Research · Information and Cyber Security