Saffron-1: Safety Inference Scaling
Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong

TL;DR
This paper introduces Saffron-1, a novel inference scaling approach for large language model safety, addressing the inefficiency of traditional methods and proposing a multifurcation reward model to enhance safety robustness during inference.
Contribution
It pioneers inference scaling for LLM safety, introducing a multifurcation reward model and strategies to improve safety evaluation efficiency during inference.
Findings
Conventional inference scaling performs poorly in safety contexts.
The proposed MRM reduces reward model evaluations significantly.
Extensive experiments validate the effectiveness of Saffron-1.
Abstract
Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning · Information and Cyber Security
