Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Jason Vega, Gagandeep Singh

TL;DR
This paper reveals vulnerabilities in current safety alignment methods for large language models and proposes a rank-based approach, PRESTO, to significantly improve safety defenses against sophisticated prefilling attacks.
Contribution
It introduces the concept of matching token ranks instead of probabilities for safety alignment and proposes PRESTO, a simple regularization method that enhances safety without harming utility.
Findings
PRESTO improves safety scores up to 4.7x under attack
Rank-based alignment outperforms probability-based methods
Vulnerabilities exist in current data augmentation defenses
Abstract
A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
