How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?

Jianian Gong; Nachuan Duan; Ziheng Tao; Zhaohui Gong; Yuan Yuan; Minlie Huang

arXiv:2408.10495·cs.SE·June 16, 2025

How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?

Jianian Gong, Nachuan Duan, Ziheng Tao, Zhaohui Gong, Yuan Yuan, Minlie Huang

PDF

Open Access

TL;DR

This paper systematically evaluates GPT-3.5 and GPT-4's ability to generate and repair secure Python code, revealing significant vulnerabilities and proposing an iterative repair tool that substantially improves code security.

Contribution

It provides a comprehensive analysis of LLMs' security awareness and introduces an iterative repair method with semantic analysis to enhance code safety.

Findings

01

Over 75% of generated code is vulnerable on SecurityEval benchmark.

02

GPT-3.5 and GPT-4 struggle to identify their own vulnerabilities.

03

The proposed iterative repair tool improves success rates to 65.9%-85.5%.

Abstract

The rapid advancement of large language models (LLMs) such as GPT-4 has revolutionized the landscape of software engineering, positioning these models at the core of modern development practices. As we anticipate these models to evolve into the primary and trustworthy tools used in software development, ensuring the security of the code they produce becomes paramount. How well can LLMs serve as end-to-end secure code producers? This paper presents a systematic investigation into LLMs' inherent potential to generate code with fewer vulnerabilities. Specifically, We studied GPT-3.5 and GPT-4's capability to identify and repair vulnerabilities in the code generated by four popular LLMs including themselves (GPT-3.5, GPT-4, Code Llama, and CodeGeeX2). By manually or automatically reviewing 4,900 pieces of code, our study reveals that: (1) large language models lack awareness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Softmax