LLM Cyber Evaluations Don't Capture Real-World Risk

Kamil\.e Luko\v{s}i\=ut\.e; Adam Swanda

arXiv:2502.00072·cs.CR·February 4, 2025

LLM Cyber Evaluations Don't Capture Real-World Risk

Kamil\.e Luko\v{s}i\=ut\.e, Adam Swanda

PDF

Open Access 1 Repo

TL;DR

This paper critiques current LLM cybersecurity risk evaluations, proposing a comprehensive framework that considers attacker behavior and impact potential, demonstrated through a case study on cybersecurity assistants.

Contribution

It introduces a new risk assessment framework for LLM cybersecurity capabilities and applies it to a case study, highlighting the importance of real-world impact analysis.

Findings

01

High compliance rates in models for cyber tasks

02

Moderate accuracy on realistic cybersecurity tasks

03

Moderate overall risk due to limited impact potential

Abstract

Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications, creating creating inherent risks alongside their potential for strengthening defenses. In this position paper, we argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact. Evaluating LLM cybersecurity risk requires more than just measuring model capabilities -- it demands a comprehensive risk assessment that incorporates analysis of threat actor adoption behavior and potential for impact. We propose a risk assessment framework for LLM cyber capabilities and apply it to a case study of language models used as cybersecurity assistants. Our evaluation of frontier models reveals high compliance rates but moderate accuracy on realistic cyber assistance tasks. However, our framework suggests that this particular use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kamilelukosiute/yet-another-cybersec-assistance-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, AI, and Intellectual Property

MethodsALIGN