Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++
Akshay Mhatre, Noujoud Nader, Patrick Diehl, Deepti Gupta

TL;DR
This paper empirically evaluates the effectiveness of LLMs like ChatGPT-4, Claude 3, and LLaMA 4 in detecting bugs and vulnerabilities in Python and C++ code, revealing strengths in simple issues but limitations in complex security flaws.
Contribution
It introduces a comprehensive benchmark and a novel multi-stage prompting protocol to systematically assess LLMs' bug detection capabilities in real-world code scenarios.
Findings
LLMs excel at identifying syntactic and semantic errors in well-scoped code.
Performance drops significantly on complex security vulnerabilities and large-scale production code.
ChatGPT-4 and Claude 3 outperform LLaMA 4 in nuanced contextual analysis.
Abstract
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
