Red Teaming Large Reasoning Models

Jiawei Chen; Yang Yang; Chao Yu; Yu Tian; Zhi Cao; Xue Yang; Linghao Li; Hang Su; Zhaoxia Yin

arXiv:2512.00412·cs.CR·April 15, 2026

Red Teaming Large Reasoning Models

Jiawei Chen, Yang Yang, Chao Yu, Yu Tian, Zhi Cao, Xue Yang, Linghao Li, Hang Su, Zhaoxia Yin

PDF

TL;DR

This paper introduces RT-LRM, a comprehensive benchmark for evaluating the trustworthiness of large reasoning models across truthfulness, safety, and efficiency, revealing their vulnerabilities and guiding future improvements.

Contribution

It presents a new unified benchmark and analytical framework for assessing and understanding the trustworthiness of large reasoning models, including a curated suite of reasoning tasks.

Findings

01

LRMs are generally less trustworthy and more fragile than LLMs under reasoning risks.

02

The benchmark reveals underexplored vulnerabilities in LRM safety and reliability.

03

A scalable toolbox for standardized trustworthiness evaluation is released.

Abstract

Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi-step reasoning tasks, offering enhanced transparency and logical consistency through explicit chains of thought (CoT). However, these models introduce novel safety and reliability risks, such as CoT-hijacking and prompt-induced inefficiencies, which are not fully captured by existing evaluation methods. To address this gap, we propose RT-LRM, a unified benchmark designed to assess the trustworthiness of LRMs. RT-LRM evaluates three core dimensions: truthfulness, safety and efficiency. Beyond metric-based evaluation, we further introduce the training paradigm as a key analytical perspective to investigate the systematic impact of different training strategies on model trustworthiness. We achieve this by designing a curated suite of 30 reasoning tasks from an observational standpoint. We conduct extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.