Evaluation of ChatGPT Model for Vulnerability Detection
Anton Cheshkov, Pavel Zadorozhny, Rodion Levichev

TL;DR
This paper assesses ChatGPT's effectiveness in detecting code vulnerabilities and finds it performs no better than a dummy classifier, highlighting limitations in applying large language models to security tasks.
Contribution
It provides an empirical evaluation of ChatGPT and GPT-3 for vulnerability detection, revealing their limitations in this specific security domain.
Findings
ChatGPT performs no better than a dummy classifier in vulnerability detection
Large language models may have limited utility for security-specific tasks
Evaluation on real-world datasets highlights current model shortcomings
Abstract
In this technical report, we evaluated the performance of the ChatGPT and GPT-3 models for the task of vulnerability detection in code. Our evaluation was conducted on our real-world dataset, using binary and multi-label classification tasks on CWE vulnerabilities. We decided to evaluate the model because it has shown good performance on other code-based tasks, such as solving programming challenges and understanding code at a high level. However, we found that the ChatGPT model performed no better than a dummy classifier for both binary and multi-label classification tasks for code vulnerability detection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Linear Warmup With Cosine Annealing · Adam · Dense Connections · Attention Dropout · Dropout
