Saffron-1: Safety Inference Scaling

Ruizhong Qiu; Gaotang Li; Tianxin Wei; Jingrui He; Hanghang Tong

arXiv:2506.06444·cs.LG·July 10, 2025

Saffron-1: Safety Inference Scaling

Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong

PDF

Open Access 1 Repo

TL;DR

This paper introduces Saffron-1, a novel inference scaling approach for large language model safety, addressing the inefficiency of traditional methods and proposing a multifurcation reward model to enhance safety robustness during inference.

Contribution

It pioneers inference scaling for LLM safety, introducing a multifurcation reward model and strategies to improve safety evaluation efficiency during inference.

Findings

01

Conventional inference scaling performs poorly in safety contexts.

02

The proposed MRM reduces reward model evaluations significantly.

03

Extensive experiments validate the effectiveness of Saffron-1.

Abstract

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

q-rz/saffron
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning · Information and Cyber Security