NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating   LLMs in Offensive Security

Minghao Shao; Sofija Jancheska; Meet Udeshi; Brendan Dolan-Gavitt,; Haoran Xi; Kimberly Milner; Boyuan Chen; Max Yin; Siddharth Garg; Prashanth; Krishnamurthy; Farshad Khorrami; Ramesh Karri; Muhammad Shafique

arXiv:2406.05590·cs.CR·February 19, 2025·1 cites

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt,, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth, Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper introduces a scalable, open-source benchmark dataset and automated framework for evaluating large language models in solving cybersecurity Capture the Flag challenges, facilitating research and development in AI-driven security solutions.

Contribution

The authors created a novel, open-source CTF benchmark dataset and an automated evaluation framework tailored for assessing LLMs in cybersecurity tasks, including support for external tool calls.

Findings

01

Evaluated five LLMs on CTF challenges with insights into their performance.

02

Provided an open-source platform for benchmarking LLMs in cybersecurity.

03

Demonstrated the potential of LLMs in automated vulnerability detection.

Abstract

Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security· slideslive

Taxonomy

TopicsNetwork Security and Intrusion Detection · Information and Cyber Security