Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Baihui Zheng, Boren Zheng, Kerui Cao, Yingshui Tan, Zhendong Liu, Weixun Wang, Jiaheng Liu, Jian Yang, Wenbo Su, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

TL;DR
This paper introduces a new benchmark called Beyond Safe Answers to evaluate whether large reasoning models truly understand and mitigate risks, revealing superficial safety issues and guiding improvements for safer AI systems.
Contribution
It presents a novel benchmark with challenging SSA scenarios, evaluates state-of-the-art models, and explores methods to improve genuine risk awareness in large reasoning models.
Findings
Top models achieve only 38% accuracy in risk rationale detection
Benchmark reveals superficial safety alignment issues in LRMs
Safety fine-tuning and decoding strategies show potential improvements
Abstract
Despite the remarkable proficiency of \textit{Large Reasoning Models} (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as \textbf{\textit{Superficial Safety Alignment} (SSA)} -- a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce \textbf{Beyond Safe Answers (BSA)} bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The investigated phenomenon is very important. While some reasoning models do not open-source their thinking process (like o1), others (like r1) expose the full thinking content to the user. This makes the Superficial Safety Alignment (SSA) much serious. 2. The authors systematically summarizes this phenomenon and provides a complete framework to evaluate it, providing a benchmark for future improvements to this issue. 3. The evaluation metric fully considers the cost of human resource eval
1. The presentation could be improved. Line 127-132 has some space that making this page appear somewhat empty. 2. How many GPU hours, total tokens, and dollar cost does one evaluation pipeline consume?
- The proposed formulation of Superficial Safety Alignment (SSA) is interesting and highlights an underexplored aspect of LRM safety. This perspective has clear practical relevance, as it draws attention to reasoning-level safety failures that are easily overlooked in standard output-based evaluations. - The evaluation protocol is well-structured and methodologically sound. It provides a clear operationalization of SSA through metrics such as Safe@k and Think@k, enabling systematic diagnosis of
### Limited contribution and narrow scope While the identified problem of Superficial Safety Alignment (SSA) is interesting and relevant, it only addresses a limited subset of safety risks in LRMs—specifically, cases where the reasoning is unsafe but the response appears safe. However, this represents only one aspect of the broader safety landscape in LRMs. Moreover, the idea of evaluating the quality of reasoning chains has already been extensively explored in recent literature. The proposed w
The authors do a good job of motivating safety with respect to these models and the example in Figure 1 was clear. The hybrid annotation approach was nicely validated and made the LLM-as-judge approach more likely to yield high quality data.
Cognitive Shortcut is a term that already has been used extensively in the literature. I recommend the authors choose a new term that will be less confusing and not pollute the literature. The alarmist language needs to be toned down for an academic publication: “alarming extent” “crucial insights” “vital tool”. More detail is needed in related work rather than just vague mentions to work. E.g., “Anthropic (10) showed Claude 3 Opus varies behaviors under evaluation” The paper is hard to read
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Explainable Artificial Intelligence (XAI) · Data Quality and Management
