An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors
Jiaxin Yu, Peng Liang, Yujia Fu, Amjed Tahir, Mojtaba Shahin, Chong Wang, Yangxiao Cai

TL;DR
This study empirically evaluates the effectiveness of seven Large Language Models in security code review, demonstrating their superiority over static analysis tools and analyzing factors influencing their performance.
Contribution
It provides a comprehensive comparison of LLMs and static analysis tools in security defect detection, highlighting the impact of prompts and code characteristics on LLM performance.
Findings
LLMs outperform static analysis tools in security defect detection.
DeepSeek-R1 and GPT-4 are the top performers among evaluated LLMs.
Prompt design and code complexity significantly influence LLM effectiveness.
Abstract
Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of seven LLMs under five different prompts and compared them with state-of-the-art static analysis tools. We also performed linguistic and regression analyses for the two top-performing LLMs to identify quality problems in their responses and factors influencing their performance. Our findings show that: (1) In security code review,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
