Finetuning Large Language Models for Vulnerability Detection

Alexey Shestov; Rodion Levichev; Ravil Mussabayev; Evgeny Maslov,; Anton Cheshkov; Pavel Zadorozhny

arXiv:2401.17010·cs.CR·July 30, 2024·5 cites

Finetuning Large Language Models for Vulnerability Detection

Alexey Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov,, Anton Cheshkov, Pavel Zadorozhny

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that finetuning large language models like WizardCoder can significantly improve vulnerability detection in source code, especially when addressing class imbalance and optimizing training regimes.

Contribution

It introduces a method for finetuning WizardCoder for vulnerability detection, enhancing training efficiency and handling class imbalance, with improved detection performance.

Findings

01

Improved ROC AUC and F1 scores over baseline models

02

Enhanced training speed without performance loss

03

Effective handling of imbalanced datasets

Abstract

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rmusab/vul-llm-finetune
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Network Security and Intrusion Detection · Software Reliability and Analysis Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings