From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Yuan Yuan; Tina Sriskandarajah; Anna-Luisa Brakman; Alec Helyar; Alex Beutel; Andrea Vallone; Saachi Jain

arXiv:2508.09224·cs.CY·August 14, 2025

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain

PDF

TL;DR

This paper introduces safe-completions, a new safety training approach for large language models that focuses on output safety rather than binary refusals, improving safety and helpfulness especially on dual-use prompts.

Contribution

The paper proposes safe-completions as an output-centric safety training method, demonstrating its integration into GPT-5 and its effectiveness over traditional refusal-based approaches.

Findings

01

Improves safety on dual-use prompts

02

Reduces severity of safety failures

03

Increases model helpfulness

Abstract

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.