Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

Akshay Mhatre; Noujoud Nader; Patrick Diehl; Deepti Gupta

arXiv:2508.16419·cs.SE·April 28, 2026

Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

Akshay Mhatre, Noujoud Nader, Patrick Diehl, Deepti Gupta

PDF

TL;DR

This paper empirically evaluates the effectiveness of LLMs like ChatGPT-4, Claude 3, and LLaMA 4 in detecting bugs and vulnerabilities in Python and C++ code, revealing strengths in simple issues but limitations in complex security flaws.

Contribution

It introduces a comprehensive benchmark and a novel multi-stage prompting protocol to systematically assess LLMs' bug detection capabilities in real-world code scenarios.

Findings

01

LLMs excel at identifying syntactic and semantic errors in well-scoped code.

02

Performance drops significantly on complex security vulnerabilities and large-scale production code.

03

ChatGPT-4 and Claude 3 outperform LLaMA 4 in nuanced contextual analysis.

Abstract

Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.