TL;DR
This paper introduces Speculative Safety-Aware Decoding (SSD), a lightweight decoding method that enhances large language models with safety properties, accelerates inference, and dynamically balances safety and utility during decoding.
Contribution
SSD is a novel decoding-time approach that uses speculative sampling and a small safety model to improve safety and efficiency of large language models.
Findings
SSD successfully adds safety properties to large models.
SSD accelerates inference compared to traditional methods.
SSD maintains helpfulness on benign queries.
Abstract
Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
