SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No

TL;DR
SAFEPATH is a lightweight method that improves safety in large reasoning models by adding a short safety primer, effectively reducing harmful outputs and jailbreak success without compromising reasoning capabilities.
Contribution
Introduces SAFEPATH, a novel fine-tuning approach that inserts an 8-token safety primer to enhance safety in reasoning models with minimal performance trade-offs.
Findings
Reduces harmful responses by up to 90%
Blocks 83.3% of jailbreak attempts
Requires significantly less compute than existing methods
Abstract
Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Abilities and Testing · Child and Animal Learning Development · Cognitive Science and Mapping
MethodsAttention Is All You Need · Softmax · Depthwise Convolution · Squared ReLU · Multi-DConv-Head Attention · Dense Connections · Primer
