Advancing LLM Safe Alignment with Safety Representation Ranking

Tianqi Du; Zeming Wei; Quan Chen; Chenheng Zhang; Yisen Wang

arXiv:2505.15710·cs.CL·May 22, 2025

Advancing LLM Safe Alignment with Safety Representation Ranking

Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, Yisen Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Safety Representation Ranking (SRR), a novel method that uses internal LLM states to better evaluate and select safe responses, enhancing robustness against harmful content.

Contribution

The paper presents a new ranking framework that leverages internal model representations for safety assessment, improving upon existing response-based safety evaluation methods.

Findings

01

SRR significantly improves safety robustness against adversarial prompts.

02

Utilizes internal transformer states for safety evaluation.

03

Achieves better safety performance across multiple benchmarks.

Abstract

The rapid advancement of large language models (LLMs) has demonstrated milestone success in a variety of tasks, yet their potential for generating harmful content has raised significant safety concerns. Existing safety evaluation approaches typically operate directly on textual responses, overlooking the rich information embedded in the model's internal representations. In this paper, we propose Safety Representation Ranking (SRR), a listwise ranking framework that selects safe responses using hidden states from the LLM itself. SRR encodes both instructions and candidate completions using intermediate transformer representations and ranks candidates via a lightweight similarity-based scorer. Our approach directly leverages internal model states and supervision at the list level to capture subtle safety signals. Experiments across multiple benchmarks show that SRR significantly improves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. This paper introduces a lightweight inference-time aligning method for LLMs, which does not need heavy training. While ranking responses itself is not new, using intermediate hidden states is an underexplored approach. 2. The paper is generally well-written, and the methodology is well-structured and described in detail.

Weaknesses

1. **The motivation of the proposed method is inadequate, which also limits the novelty of the paper.** The paper's central claim that ranking candidate responses using internal representations is superior to using final text is not sufficiently motivated or proven. It states that traditional reward models "may miss fine-grained safety cues embedded in the LLM’s state vectors," but provide no empirical evidence to support this critical claim. The entire motivation for using intermediate represen

Reviewer 02Rating 2Confidence 4

Strengths

The main strengths of this paper are: 1) The problem this paper studies is important: Aligning LLM's generation to be more safe. 2) The proposed approach is simple and easy to implement. 3) The experiments span three different LLM architectures, and a few datasets.

Weaknesses

Despite the paper's strengths, there are major weaknesses that need to be addressed before getting this paper accepted: 1) From the methodology side, I have the following two criticisms that need to be addressed: 1a) Leveraging the embedding of the last generated token as an embedding for the entire instruction/response seems incorrect and has too little information about the entire sequence. One should employ the embeddings for all instruction/response tokens (perhaps average them). 1b) The

Reviewer 03Rating 2Confidence 4

Strengths

The Authors introduce an alternative to safe generation by using a safety reranker with multiple response candidates. The SRR is validated on multiple datasets and models. The Authors have demonstrated how SRR performs on benign datasets and bias datasets. The experiments in the article are insufficient to ensure that the SRR would be beneficial in a real-life scenario.

Weaknesses

* The Authors don’t explain in detail how $h_{resp,i}$ is calculated, as the hidden representations of responses have an additional dimension of length in terms of tokens. * In L163-164, the Authors state: “Since the backbone is trained for next-token prediction, the final layers tend to overfit to this specific task,” but don’t provide any evidence/literature for this statement. * I see no theoretical or intuitive reason why the usage of a transformer encoder would be beneficial for this encodi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOccupational Health and Safety Research · Quality and Management Systems · Risk and Safety Analysis