Speculative Safety-Aware Decoding

Xuekang Wang; Shengyu Zhu; Xueqi Cheng

arXiv:2508.17739·cs.LG·September 30, 2025

Speculative Safety-Aware Decoding

Xuekang Wang, Shengyu Zhu, Xueqi Cheng

PDF

1 Video

TL;DR

This paper introduces Speculative Safety-Aware Decoding (SSD), a lightweight decoding method that enhances large language models with safety properties, accelerates inference, and dynamically balances safety and utility during decoding.

Contribution

SSD is a novel decoding-time approach that uses speculative sampling and a small safety model to improve safety and efficiency of large language models.

Findings

01

SSD successfully adds safety properties to large models.

02

SSD accelerates inference compared to traditional methods.

03

SSD maintains helpfulness on benign queries.

Abstract

Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Speculative Safety-Aware Decoding· underline