SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Hwiwon Lee; Ziqi Zhang; Hanxiao Lu; Lingming Zhang

arXiv:2506.11791·cs.LG·October 23, 2025

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

SEC-bench is an automated framework for evaluating LLM agents on real-world security tasks, revealing significant performance gaps and guiding future improvements in security engineering capabilities.

Contribution

We introduce SEC-bench, a novel automated benchmarking framework that constructs realistic security datasets and evaluates LLM agents on authentic security tasks.

Findings

01

LLM agents achieve up to 18% success in PoC generation.

02

LLM agents achieve up to 34% success in vulnerability patching.

03

SEC-bench reveals substantial performance gaps in current LLM security capabilities.

Abstract

Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sec-bench/sec-bench
noneOfficial

Videos

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks· slideslive

Taxonomy

TopicsBusiness Process Modeling and Analysis · Advanced Malware Detection Techniques · Scientific Computing and Data Management

MethodsActivation Patching