SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Wonje Jeung; Sangyeon Yoon; Minsuk Kahng; Albert No

arXiv:2505.14667·cs.AI·October 24, 2025

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No

PDF

Open Access 2 Models

TL;DR

SAFEPATH is a lightweight method that improves safety in large reasoning models by adding a short safety primer, effectively reducing harmful outputs and jailbreak success without compromising reasoning capabilities.

Contribution

Introduces SAFEPATH, a novel fine-tuning approach that inserts an 8-token safety primer to enhance safety in reasoning models with minimal performance trade-offs.

Findings

01

Reduces harmful responses by up to 90%

02

Blocks 83.3% of jailbreak attempts

03

Requires significantly less compute than existing methods

Abstract

Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Abilities and Testing · Child and Animal Learning Development · Cognitive Science and Mapping

MethodsAttention Is All You Need · Softmax · Depthwise Convolution · Squared ReLU · Multi-DConv-Head Attention · Dense Connections · Primer