VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models
Yu Liu, Lang Gao, Mingxin Yang, Yu Xie, Ping Chen, Xiaojin Zhang, Wei, Chen

TL;DR
VulDetectBench is a new benchmark designed to evaluate the vulnerability detection capabilities of large language models across multiple tasks, revealing strengths in basic detection but weaknesses in detailed vulnerability analysis.
Contribution
The paper introduces VulDetectBench, a comprehensive benchmark for assessing LLMs' ability to detect, classify, and locate code vulnerabilities, filling a gap in specialized vulnerability research.
Findings
Models achieve over 80% accuracy in vulnerability identification and classification.
Models perform poorly (<30%) on detailed vulnerability analysis tasks.
VulDetectBench provides a standardized evaluation framework for future improvements.
Abstract
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities
