Software Vulnerability Prediction in Low-Resource Languages: An   Empirical Study of CodeBERT and ChatGPT

Triet H. M. Le; M. Ali Babar; Tung Hoang Thai

arXiv:2404.17110·cs.SE·April 29, 2024

Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Triet H. M. Le, M. Ali Babar, Tung Hoang Thai

PDF

Open Access 1 Repo

TL;DR

This study evaluates the impact of data scarcity on software vulnerability prediction in low-resource languages and explores ChatGPT as a promising solution, showing significant performance improvements over traditional models.

Contribution

It provides the first empirical assessment of ChatGPT's effectiveness for low-resource SV prediction and highlights the limitations of data sampling techniques with CodeBERT.

Findings

01

CodeBERT's performance drops significantly in low-resource languages.

02

Data sampling techniques do not improve CodeBERT's predictions.

03

ChatGPT improves SV prediction accuracy by up to 53.5%.

Abstract

Background: Software Vulnerability (SV) prediction in emerging languages is increasingly important to ensure software security in modern systems. However, these languages usually have limited SV data for developing high-performing prediction models. Aims: We conduct an empirical study to evaluate the impact of SV data scarcity in emerging languages on the state-of-the-art SV prediction model and investigate potential solutions to enhance the performance. Method: We train and test the state-of-the-art model based on CodeBERT with and without data sampling techniques for function-level and line-level SV prediction in three low-resource languages - Kotlin, Swift, and Rust. We also assess the effectiveness of ChatGPT for low-resource SV prediction given its recent success in other domains. Results: Compared to the original work in C/C++ with large data, CodeBERT's performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lhmtriet/llm4vul
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research · Software Engineering Research · Web Application Security Vulnerabilities

MethodsCodeBERT