Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Tung-Ling Li; Hongliang Liu

arXiv:2506.24056·cs.CR·May 5, 2026

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Tung-Ling Li, Hongliang Liu

PDF

TL;DR

This paper introduces the refusal-affirmation logit gap as a metric for measuring alignment robustness in language models and proposes a gradient-free method to discover suffixes that close this gap, enhancing safety margins.

Contribution

It presents a new scalar metric for alignment margin quantification and a practical, efficient suffix discovery method that transfers across models and improves safety defenses.

Findings

01

Alignment widens the logit gap on 97.5-99.8% of toxic prompts.

02

The logit-gap steering method requires about 26,000 forward passes per model family.

03

Discovered suffixes transfer across models and significantly improve safety metrics.

Abstract

RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference between the top refusal-token logit and the top affirmative-token logit at the first decoding step. This single scalar quantifies the per-prompt safety margin that alignment provides. Empirically, alignment widens the gap on 97.5-99.8% of toxic prompts across three model families, and median gap closure co-varies with True-ASR ranking across suffix strategies (an internal consistency check, since our method optimises gap closure). To validate the metric's practical significance, we present logit-gap steering, a gradient-free, forward-pass-only method that discovers short in-distribution suffixes ( $<$ 10 tokens per component) whose cumulative effect closes the gap. The method requires…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.