From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Mengya Hu; Qiong Wei; Sandeep Atluri

arXiv:2604.26052·cs.CL·May 21, 2026

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Mengya Hu, Qiong Wei, Sandeep Atluri

PDF

1 Repo

TL;DR

This study analyzes how large language models' responses can either reduce, preserve, or escalate harm across different categories, revealing nuanced safety behaviors and tradeoffs in model responses.

Contribution

It introduces a paired analysis method for assessing prompt and response safety, uncovering mechanisms of harm escalation and relevance tradeoffs in LLMs.

Findings

01

61% responses reduce harm compared to prompts

02

36% responses preserve harm severity

03

3% responses escalate harm severity

Abstract

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/PairedSafety
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.