Red Teaming Large Reasoning Models
Jiawei Chen, Yang Yang, Chao Yu, Yu Tian, Zhi Cao, Xue Yang, Linghao Li, Hang Su, Zhaoxia Yin

TL;DR
This paper introduces RT-LRM, a comprehensive benchmark for evaluating the trustworthiness of large reasoning models across truthfulness, safety, and efficiency, revealing their vulnerabilities and guiding future improvements.
Contribution
It presents a new unified benchmark and analytical framework for assessing and understanding the trustworthiness of large reasoning models, including a curated suite of reasoning tasks.
Findings
LRMs are generally less trustworthy and more fragile than LLMs under reasoning risks.
The benchmark reveals underexplored vulnerabilities in LRM safety and reliability.
A scalable toolbox for standardized trustworthiness evaluation is released.
Abstract
Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi-step reasoning tasks, offering enhanced transparency and logical consistency through explicit chains of thought (CoT). However, these models introduce novel safety and reliability risks, such as CoT-hijacking and prompt-induced inefficiencies, which are not fully captured by existing evaluation methods. To address this gap, we propose RT-LRM, a unified benchmark designed to assess the trustworthiness of LRMs. RT-LRM evaluates three core dimensions: truthfulness, safety and efficiency. Beyond metric-based evaluation, we further introduce the training paradigm as a key analytical perspective to investigate the systematic impact of different training strategies on model trustworthiness. We achieve this by designing a curated suite of 30 reasoning tasks from an observational standpoint. We conduct extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
