ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

Seunghyun Lee; David Brumley

arXiv:2605.14153·cs.CR·May 15, 2026

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

Seunghyun Lee, David Brumley

PDF

TL;DR

ExploitBench introduces a detailed, capability-graded benchmark for evaluating LLMs in cybersecurity, revealing a significant gap between public and private models in exploiting vulnerabilities.

Contribution

The paper presents a novel, granular benchmark decomposing exploitation into 16 measurable capabilities, enabling precise assessment of LLMs' cybersecurity exploitation skills.

Findings

01

Public models routinely trigger crashes but rarely achieve code execution.

02

Private models show about 50% success in arbitrary code execution.

03

Exploitation of hardened targets remains an emerging frontier for LLMs.

Abstract

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.