Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Dongcheng Zhang; Yi Zhang; Yuxin Chen; An Zhang; Xiang Wang; Chaochao Lu

arXiv:2605.08936·cs.AI·May 12, 2026

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu

PDF

1 Repo

TL;DR

Self-ReSET introduces a reinforcement learning framework that improves large reasoning models' ability to recover from unsafe trajectories, especially under adversarial attacks, by reusing their own failure states for training.

Contribution

It presents a novel reinforcement learning approach enabling models to self-recover from unsafe reasoning errors, addressing limitations of static training data.

Findings

01

Significantly improves robustness against adversarial attacks.

02

Enhances model recovery from unsafe intermediate states.

03

Maintains general utility while increasing safety.

Abstract

Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ing1024/Self-ReSET
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.