SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang

TL;DR
SEC-bench is an automated framework for evaluating LLM agents on real-world security tasks, revealing significant performance gaps and guiding future improvements in security engineering capabilities.
Contribution
We introduce SEC-bench, a novel automated benchmarking framework that constructs realistic security datasets and evaluates LLM agents on authentic security tasks.
Findings
LLM agents achieve up to 18% success in PoC generation.
LLM agents achieve up to 34% success in vulnerability patching.
SEC-bench reveals substantial performance gaps in current LLM security capabilities.
Abstract
Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBusiness Process Modeling and Analysis · Advanced Malware Detection Techniques · Scientific Computing and Data Management
MethodsActivation Patching
