Towards Supporting Penetration Testing Education with Large Language Models: an Evaluation and Comparison
Martin Nizon-Deladoeuille, Brynj\'olfur Stef\'ansson, Helmut Neukirchen, Thomas Welsh

TL;DR
This paper evaluates the effectiveness of various large language models in supporting penetration testing education through a comprehensive set of real-world tasks.
Contribution
It provides a comparative analysis of multiple LLMs' capabilities in cybersecurity education, highlighting GPT-4o mini's consistency and the potential of WhiteRabbitNeo.
Findings
GPT-4o mini offers the most consistent support for penetration testing tasks.
WhiteRabbitNeo's innovative tool and command recommendations enhance LLM support.
Further research is needed to optimize LLMs for complex cybersecurity tasks.
Abstract
Cybersecurity education is challenging and it is helpful for educators to understand Large Language Models' (LLMs') capabilities for supporting education. This study evaluates the effectiveness of LLMs in conducting a variety of penetration testing tasks. Fifteen representative tasks were selected to cover a comprehensive range of real-world scenarios. We evaluate the performance of 6 models (GPT-4o mini, GPT-4o, Gemini 1.5 Flash, Llama 3.1 405B, Mixtral 8x7B and WhiteRabbitNeo) upon the Metasploitable v3 Ubuntu image and OWASP WebGOAT. Our findings suggest that GPT-4o mini currently offers the most consistent support making it a valuable tool for educational purposes. However, its use in conjonction with WhiteRabbitNeo should be considered, because of its innovative approach to tool and command recommendations. This study underscores the need for continued research into optimising LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
