Vulnerability Mitigation System (VMS): LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Farzana Abdulzada

arXiv:2507.21113·cs.CR·July 30, 2025

Vulnerability Mitigation System (VMS): LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Farzana Abdulzada

PDF

TL;DR

The paper introduces VMS, an autonomous LLM-based agent for penetration testing, along with new benchmarks, demonstrating effective cybersecurity testing with GPT-4o, while emphasizing safety and public availability.

Contribution

It presents a novel LLM-powered agent architecture for autonomous penetration testing and introduces standardized benchmarks for evaluating such systems.

Findings

01

GPT-4o outperformed other LLMs in tests.

02

VMS effectively automates penetration testing tasks.

03

Benchmarks enable standardized evaluation of cybersecurity agents.

Abstract

As the frequency of cyber threats increases, conventional penetration testing is failing to capture the entirety of todays complex environments. To solve this problem, we propose the Vulnerability Mitigation System (VMS), a novel agent based on a Large Language Model (LLM) capable of performing penetration testing without human intervention. The VMS has a two-part architecture for planning and a Summarizer, which enable it to generate commands and process feedback. To standardize testing, we designed two new Capture the Flag (CTF) benchmarks based on the PicoCTF and OverTheWire platforms with 200 challenges. These benchmarks allow us to evaluate how effectively the system functions. We performed a number of experiments using various LLMs while tuning the temperature and top-p parameters and found that GPT-4o performed best, sometimes even better than expected. The results indicate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.