Good News for Script Kiddies? Evaluating Large Language Models for   Automated Exploit Generation

David Jin; Qian Fu; Yuekang Li

arXiv:2505.01065·cs.CR·May 5, 2025

Good News for Script Kiddies? Evaluating Large Language Models for Automated Exploit Generation

David Jin, Qian Fu, Yuekang Li

PDF

Open Access

TL;DR

This study systematically evaluates large language models' ability to generate exploits, revealing high cooperativeness in some models but limited success in generating exploits for refactored, more secure labs.

Contribution

Introduces a benchmark with refactored security labs and an LLM-based attacker to assess LLMs' effectiveness in automated exploit generation.

Findings

01

GPT-4 and GPT-4o show high cooperativeness

02

Llama3 is most resistant to exploitation

03

No model successfully exploits refactored labs

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code-related tasks, raising concerns about their potential for automated exploit generation (AEG). This paper presents the first systematic study on LLMs' effectiveness in AEG, evaluating both their cooperativeness and technical proficiency. To mitigate dataset bias, we introduce a benchmark with refactored versions of five software security labs. Additionally, we design an LLM-based attacker to systematically prompt LLMs for exploit generation. Our experiments reveal that GPT-4 and GPT-4o exhibit high cooperativeness, comparable to uncensored models, while Llama3 is the most resistant. However, no model successfully generates exploits for refactored labs, though GPT-4o's minimal errors highlight the potential for LLM-driven AEG advancements.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Reliability and Analysis Research · Web Application Security Vulnerabilities

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Absolute Position Encodings