Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Soham Gadgil; Chris Lin; Su-In Lee

arXiv:2604.03867·cs.LG·April 7, 2026

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Soham Gadgil, Chris Lin, Su-In Lee

PDF

TL;DR

This paper introduces W2S, an adaptive framework for selecting the optimal layer for steering in LLMs based on input, significantly improving alignment performance over fixed-layer methods.

Contribution

The paper proposes an input-dependent layer selection method for steering vectors, addressing the limitation of fixed-layer interventions in LLM alignment.

Findings

01

W2S outperforms fixed-layer baselines in various settings.

02

Optimal steering layers vary substantially across inputs.

03

Input-dependent control enhances LLM alignment effectiveness.

Abstract

Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.