Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen; Zhi Rui Tam; Cheng-Kuang Wu; Yun-Nung Chen

arXiv:2602.01600·cs.CR·February 3, 2026

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu, Yun-Nung Chen

PDF

Open Access

TL;DR

This paper introduces Expected Harm, a new safety evaluation metric for LLMs that considers both severity and execution likelihood of malicious queries, revealing systematic vulnerabilities and miscalibrations in current models.

Contribution

It proposes the Expected Harm metric, analyzes model vulnerabilities related to risk calibration, and uncovers the lack of internal representation of execution cost in LLMs.

Findings

01

Models show inverse risk calibration, refusing low-likelihood threats but vulnerable to high-likelihood ones.

02

Exploiting calibration issues can double jailbreak success rates.

03

Models encode severity but lack internal understanding of execution cost.

Abstract

Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform risk across all malicious queries, neglecting Execution Likelihood--the conditional probability of a threat being realized given the model's response. In this work, we introduce Expected Harm, a metric that weights the severity of a jailbreak by its execution likelihood, modeled as a function of execution cost. Through empirical analysis of state-of-the-art models, we reveal a systematic Inverse Risk Calibration: models disproportionately exhibit stronger refusal behaviors for low-likelihood (high-cost) threats while remaining vulnerable to high-likelihood (low-cost) queries. We demonstrate that this miscalibration creates a structural vulnerability: by exploiting this property, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Network Security and Intrusion Detection