LLM-based Vulnerable Code Augmentation: Generate or Refactor?
Dyna Soumhane Ouchebara, St\'ephane Dupont

TL;DR
This paper explores using large language models to augment vulnerable code datasets by generating new samples or refactoring existing ones, improving vulnerability classification performance.
Contribution
It compares controlled generation and refactoring methods for augmenting vulnerable code, demonstrating a hybrid approach's effectiveness.
Findings
Augmentation improves vulnerability classifier performance.
Hybrid generation and refactoring strategy yields best results.
Augmented data quality is reasonable and beneficial.
Abstract
Vulnerability code-bases often suffer from severe imbalance, limiting the effectiveness of Deep Learning-based vulnerability classifiers. Data Augmentation could help solve this by mitigating the scarcity of under-represented vulnerability types. In this context, we investigate LLM-based augmentation for vulnerable functions, comparing controlled generation of new vulnerable samples with semantics-preserving refactoring of existing ones. Using Qwen2.5-Coder to produce augmented data and CodeBERT as a classifier on the SVEN dataset, we find that our approaches are indeed effective in enriching vulnerable code-bases through a simple process and with reasonable quality, and that a hybrid strategy best boosts vulnerability classifiers' performance. Code repository is available here : https://github.com/DynaSoumhaneOuchebara/LLM-based-code-augmentation-Generate-or-Refactor-
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Advanced Malware Detection Techniques · Security and Verification in Computing
