Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

Taha Hammadia; Lucas Rea; Ahmad Mohammad Saber; Amr Youssef; and Deepa Kundur

arXiv:2604.23341·cs.CR·May 1, 2026

Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

Taha Hammadia, Lucas Rea, Ahmad Mohammad Saber, Amr Youssef, and Deepa Kundur

PDF

TL;DR

This study assesses jailbreaking risks in LLMs used for smart grid operations, revealing vulnerabilities to malicious prompts that could violate regulatory standards, with varying success across different models and attack methods.

Contribution

It provides a benchmark evaluating the susceptibility of leading LLMs to jailbreaking attacks in a critical infrastructure context, highlighting model-specific vulnerabilities and the impact of prompt refinement.

Findings

01

DeepInception achieved the highest attack success rate at 63.17%.

02

Claude 3.5 Haiku showed complete resistance to jailbreaking attempts.

03

Refined prompts increased attack success rate to 30.6% for simpler methods.

Abstract

The deployment of Large Language Models (LLMs) as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignments to produce outputs violating regulatory standards, assuming threats from authorized users, such as operators, who craft malicious prompts to elicit non-compliant guidance. Three state-of-the-art LLMs (OpenAI's GPT-4o mini, Google's Gemini 2.0 Flash-Lite, and Anthropic's Claude 3.5 Haiku) were tested against Baseline, BitBypass, and DeepInception jailbreaking methods across scenarios derived from nine NERC Reliability Standards (EOP, TOP, and CIP). In the initial broad experiment, the overall Attack Success Rate (ASR) was 33.1%, with DeepInception proving most effective at 63.17% ASR. Claude…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.