TL;DR
Self-ReSET introduces a reinforcement learning framework that improves large reasoning models' ability to recover from unsafe trajectories, especially under adversarial attacks, by reusing their own failure states for training.
Contribution
It presents a novel reinforcement learning approach enabling models to self-recover from unsafe reasoning errors, addressing limitations of static training data.
Findings
Significantly improves robustness against adversarial attacks.
Enhances model recovery from unsafe intermediate states.
Maintains general utility while increasing safety.
Abstract
Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
