Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

Xiaomin Li; Jianheng Hou; Zheyuan Deng; Zhiwei Zhang; Taoran Li; Binghang Lu; Bing Hu; Yunhan Zhao; Yuexing Hao

arXiv:2605.05678·cs.AI·May 8, 2026

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

Xiaomin Li, Jianheng Hou, Zheyuan Deng, Zhiwei Zhang, Taoran Li, Binghang Lu, Bing Hu, Yunhan Zhao, Yuexing Hao

PDF

1 Datasets

TL;DR

This paper investigates safety risks in large reasoning models throughout their reasoning process and introduces an adaptive multi-principle steering method to mitigate unsafe outputs effectively.

Contribution

It demonstrates that safety issues often occur during reasoning traces and proposes a white-box mitigation technique that improves safety without sacrificing accuracy.

Findings

01

Reasoning traces reveal additional safety risks beyond final answers.

02

Adaptive steering reduces unsafe counts by up to 40.8% in reasoning models.

03

Safety evaluation should consider the entire reasoning-answer trajectory.

Abstract

Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources, we evaluate 15 open-weight and API-based LRMs across 41K prompts per model. Reasoning traces consistently reveal additional safety risks beyond final answers, especially in high-severity stage-wise failures: leak cases, where unsafe reasoning precedes a safe-looking answer, and escape cases, where benign-looking reasoning precedes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HJH2CMD/lrm-safety-eval
dataset· 47 dl
47 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.