DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang; Marc-Alexandre Cote; Michael Albada; Anush Sankaran; Jack W. Stokes; Tong Wang; Amir Abdi; William Blum; Muhammad Abdul-Mageed

arXiv:2506.00739·cs.CL·October 15, 2025

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

PDF

Open Access 1 Repo 1 Video

TL;DR

DefenderBench is an open-source toolkit designed to evaluate language models in cybersecurity tasks, providing a standardized framework for fair comparison and benchmarking of various LLMs across multiple cybersecurity domains.

Contribution

It introduces a comprehensive, accessible toolkit for evaluating language agents in cybersecurity, including environments and benchmarks for diverse tasks, with standardized assessment framework.

Findings

01

Claude-3.7-sonnet achieves highest score of 81.65

02

Open-weight Llama 3.3 70B scores 71.81

03

Benchmarking reveals performance differences among models

Abstract

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/defenderbench
noneOfficial

Videos

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments· underline

Taxonomy

TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · Network Security and Intrusion Detection

MethodsLLaMA