Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection
Damian Gnieciak, Tomasz Szandala

TL;DR
This paper compares large language models and static code analysis tools for vulnerability detection, showing LLMs have higher recall but more false positives, suggesting a hybrid approach for software security testing.
Contribution
It provides the first systematic benchmark comparing LLMs and static analyzers on real-world vulnerabilities, highlighting their strengths and limitations.
Findings
LLMs achieve higher F-1 scores than static tools.
Larger recall of LLMs enables broader vulnerability detection.
Static tools have fewer false positives and better localization accuracy.
Abstract
Modern software relies on a multitude of automated testing and quality assurance tools to prevent errors, bugs and potential vulnerabilities. This study sets out to provide a head-to-head, quantitative and qualitative evaluation of six automated approaches: three industry-standard rule-based static code-analysis tools (SonarQube, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3). Using a curated suite of ten real-world C# projects that embed 63 vulnerabilities across common categories such as SQL injection, hard-coded secrets and outdated dependencies, we measure classical detection accuracy (precision, recall, F-score), analysis latency, and the developer effort required to vet true positives. The language-based scanners achieve higher mean F-1 scores,0.797, 0.753 and 0.750, than their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
