Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu, Yun-Nung Chen

TL;DR
This paper introduces Expected Harm, a new safety evaluation metric for LLMs that considers both severity and execution likelihood of malicious queries, revealing systematic vulnerabilities and miscalibrations in current models.
Contribution
It proposes the Expected Harm metric, analyzes model vulnerabilities related to risk calibration, and uncovers the lack of internal representation of execution cost in LLMs.
Findings
Models show inverse risk calibration, refusing low-likelihood threats but vulnerable to high-likelihood ones.
Exploiting calibration issues can double jailbreak success rates.
Models encode severity but lack internal understanding of execution cost.
Abstract
Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform risk across all malicious queries, neglecting Execution Likelihood--the conditional probability of a threat being realized given the model's response. In this work, we introduce Expected Harm, a metric that weights the severity of a jailbreak by its execution likelihood, modeled as a function of execution cost. Through empirical analysis of state-of-the-art models, we reveal a systematic Inverse Risk Calibration: models disproportionately exhibit stronger refusal behaviors for low-likelihood (high-cost) threats while remaining vulnerable to high-likelihood (low-cost) queries. We demonstrate that this miscalibration creates a structural vulnerability: by exploiting this property, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Network Security and Intrusion Detection
