Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness
Tung-Ling Li, Hongliang Liu

TL;DR
This paper introduces the refusal-affirmation logit gap as a metric for measuring alignment robustness in language models and proposes a gradient-free method to discover suffixes that close this gap, enhancing safety margins.
Contribution
It presents a new scalar metric for alignment margin quantification and a practical, efficient suffix discovery method that transfers across models and improves safety defenses.
Findings
Alignment widens the logit gap on 97.5-99.8% of toxic prompts.
The logit-gap steering method requires about 26,000 forward passes per model family.
Discovered suffixes transfer across models and significantly improve safety metrics.
Abstract
RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference between the top refusal-token logit and the top affirmative-token logit at the first decoding step. This single scalar quantifies the per-prompt safety margin that alignment provides. Empirically, alignment widens the gap on 97.5-99.8% of toxic prompts across three model families, and median gap closure co-varies with True-ASR ranking across suffix strategies (an internal consistency check, since our method optimises gap closure). To validate the metric's practical significance, we present logit-gap steering, a gradient-free, forward-pass-only method that discovers short in-distribution suffixes (10 tokens per component) whose cumulative effect closes the gap. The method requires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
