Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

Baihui Zheng; Boren Zheng; Kerui Cao; Yingshui Tan; Zhendong Liu; Weixun Wang; Jiaheng Liu; Jian Yang; Wenbo Su; Xiaoyong Zhu; Bo Zheng; Kaifu Zhang

arXiv:2505.19690·cs.AI·May 27, 2025

Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

Baihui Zheng, Boren Zheng, Kerui Cao, Yingshui Tan, Zhendong Liu, Weixun Wang, Jiaheng Liu, Jian Yang, Wenbo Su, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a new benchmark called Beyond Safe Answers to evaluate whether large reasoning models truly understand and mitigate risks, revealing superficial safety issues and guiding improvements for safer AI systems.

Contribution

It presents a novel benchmark with challenging SSA scenarios, evaluates state-of-the-art models, and explores methods to improve genuine risk awareness in large reasoning models.

Findings

01

Top models achieve only 38% accuracy in risk rationale detection

02

Benchmark reveals superficial safety alignment issues in LRMs

03

Safety fine-tuning and decoding strategies show potential improvements

Abstract

Despite the remarkable proficiency of \textit{Large Reasoning Models} (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as \textbf{\textit{Superficial Safety Alignment} (SSA)} -- a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce \textbf{Beyond Safe Answers (BSA)} bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The investigated phenomenon is very important. While some reasoning models do not open-source their thinking process (like o1), others (like r1) expose the full thinking content to the user. This makes the Superficial Safety Alignment (SSA) much serious. 2. The authors systematically summarizes this phenomenon and provides a complete framework to evaluate it, providing a benchmark for future improvements to this issue. 3. The evaluation metric fully considers the cost of human resource eval

Weaknesses

1. The presentation could be improved. Line 127-132 has some space that making this page appear somewhat empty. 2. How many GPU hours, total tokens, and dollar cost does one evaluation pipeline consume?

Reviewer 02Rating 4Confidence 4

Strengths

- The proposed formulation of Superficial Safety Alignment (SSA) is interesting and highlights an underexplored aspect of LRM safety. This perspective has clear practical relevance, as it draws attention to reasoning-level safety failures that are easily overlooked in standard output-based evaluations. - The evaluation protocol is well-structured and methodologically sound. It provides a clear operationalization of SSA through metrics such as Safe@k and Think@k, enabling systematic diagnosis of

Weaknesses

### Limited contribution and narrow scope While the identified problem of Superficial Safety Alignment (SSA) is interesting and relevant, it only addresses a limited subset of safety risks in LRMs—specifically, cases where the reasoning is unsafe but the response appears safe. However, this represents only one aspect of the broader safety landscape in LRMs. Moreover, the idea of evaluating the quality of reasoning chains has already been extensively explored in recent literature. The proposed w

Reviewer 03Rating 2Confidence 5

Strengths

The authors do a good job of motivating safety with respect to these models and the example in Figure 1 was clear. The hybrid annotation approach was nicely validated and made the LLM-as-judge approach more likely to yield high quality data.

Weaknesses

Cognitive Shortcut is a term that already has been used extensively in the literature. I recommend the authors choose a new term that will be less confusing and not pollute the literature. The alarmist language needs to be toned down for an academic publication: “alarming extent” “crucial insights” “vital tool”. More detail is needed in related work rather than just vague mentions to work. E.g., “Anthropic (10) showed Claude 3 Opus varies behaviors under evaluation” The paper is hard to read

Code & Models

Repositories

openstellarteam/bsa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Explainable Artificial Intelligence (XAI) · Data Quality and Management