From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain

TL;DR
This paper introduces safe-completions, a new safety training approach for large language models that focuses on output safety rather than binary refusals, improving safety and helpfulness especially on dual-use prompts.
Contribution
The paper proposes safe-completions as an output-centric safety training method, demonstrating its integration into GPT-5 and its effectiveness over traditional refusal-based approaches.
Findings
Improves safety on dual-use prompts
Reduces severity of safety failures
Increases model helpfulness
Abstract
Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
