Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks
Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal, Unal Tatar

TL;DR
This paper benchmarks 10 large language models on 200 cybersecurity challenges, revealing environment tools and model choice as key performance factors, with some prompt strategies being counterproductive.
Contribution
It provides the most comprehensive evaluation of LLM agents on offensive cybersecurity tasks, extending existing frameworks with multi-provider support and detailed analysis.
Findings
Kali Linux environment improves success rate by 9.5 percentage points.
Claude 4.5 Opus achieves the highest solve rate at 59%.
Environment tooling and model selection are primary performance drivers.
Abstract
We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend support, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage-point improvement over Ubuntu, while auto-prompting and category-specific tips often degrade performance in well-equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost-efficiency at $0.05 per solve. Asymmetric planner/executor model assignments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
