Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
Tanay Gondil

TL;DR
This study evaluates whether large language models can predict their own refusal behavior, revealing high sensitivity but challenges at safety boundaries, with confidence scores aiding safety-critical decision-making.
Contribution
The paper systematically assesses models' introspective awareness of safety boundaries, demonstrating generational improvements and the utility of confidence scores for safety routing.
Findings
Models exhibit high introspective sensitivity (d' = 2.4-3.5).
Confidence scores enable 98.3% accuracy in high-confidence predictions.
Refusal prediction accuracy varies across models and topics, especially weapons-related queries.
Abstract
Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
