TL;DR
CritBench is a new framework for evaluating the cybersecurity skills of large language models specifically within IEC 61850 digital substation environments, addressing a gap in existing IT-focused assessments.
Contribution
It introduces CritBench, a domain-specific evaluation framework, and assesses multiple models on 81 cybersecurity tasks in operational technology settings.
Findings
Models reliably perform static configuration analysis and network enumeration.
Performance drops on dynamic tasks requiring live system interaction.
Tool scaffolds improve operational capabilities of models.
Abstract
The advancement of Large Language Models (LLMs) has raised concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state-of-the-art models, including OpenAI's GPT-5 suite and open-weight models, across a corpus of 81 domain-specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain-specific tool scaffold. Our empirical results show that agents reliably execute static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
