Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Mingjie Li; Wai Man Si; Michael Backes; Yang Zhang; Yisen Wang

arXiv:2604.00012·cs.CL·April 2, 2026

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

PDF

1 Video

TL;DR

This paper investigates why post-training reduces LLM safety and introduces SafeReAct, a lightweight method to restore safety mechanisms without harming reasoning abilities.

Contribution

The paper reveals that post-training masks safety features in LLMs and proposes SafeReAct to reactivate these safety mechanisms efficiently.

Findings

01

SafeReAct significantly improves safety on harmful prompts.

02

The method does not compromise reasoning performance.

03

Safety mechanisms still exist in post-trained LRMs.

Abstract

Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms· slideslive