Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Tyler H. Merves; Michael H. Conaway; Joseph M. Escobar; Hakan T. Otal; Unal Tatar

arXiv:2604.17159·cs.CR·April 21, 2026

Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal, Unal Tatar

PDF

TL;DR

This paper benchmarks 10 large language models on 200 cybersecurity challenges, revealing environment tools and model choice as key performance factors, with some prompt strategies being counterproductive.

Contribution

It provides the most comprehensive evaluation of LLM agents on offensive cybersecurity tasks, extending existing frameworks with multi-provider support and detailed analysis.

Findings

01

Kali Linux environment improves success rate by 9.5 percentage points.

02

Claude 4.5 Opus achieves the highest solve rate at 59%.

03

Environment tooling and model selection are primary performance drivers.

Abstract

We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend support, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage-point improvement over Ubuntu, while auto-prompting and category-specific tips often degrade performance in well-equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost-efficiency at $0.05 per solve. Asymmetric planner/executor model assignments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.