Self-Improving Model Steering

Rongyi Zhu; Yuhui Wang; Tanqiu Jiang; Jiacheng Liang; Ting Wang

arXiv:2507.08967·cs.CL·July 15, 2025

Self-Improving Model Steering

Rongyi Zhu, Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Ting Wang

PDF

3 Reviews

TL;DR

SIMS introduces a self-improving framework for model steering that eliminates the need for external supervision by autonomously generating and refining contrastive samples, leading to more adaptable and effective alignment of large language models.

Contribution

The paper presents SIMS, the first self-improving model-steering method that operates without external annotations, using iterative self-refinement and novel strategies to enhance LLM alignment.

Findings

01

Outperforms existing steering methods in effectiveness

02

Demonstrates adaptability across diverse LLMs and benchmarks

03

Reduces reliance on external annotated data

Abstract

Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

1. The paper introduces a new perspective on inference-time alignment by combining ideas from activation steering and self-improvement loops. 2. The results show consistent improvement, across the iterations

Weaknesses

1. The paper frames SIMS as an inference-time alignment method, yet Eq. (5) optimizes the policy $\pi$, implying a parameter update process rather than a pure activation-level intervention. The method for solving the arg min / arg max over π is never described, leaving the optimization step undefined. 2. Eq. (5) includes a normalization constant $Z_{\pi_t}$ with no analytical form or computational approximation. The absence of this detail makes the main optimization equation incomplete and unre

Reviewer 02Rating 6Confidence 3

Strengths

This paper is well-written. This paper proposes a new adaptive model steering framework, leveraging iterative self-improvement cycles that do not rely on external supervision. The experimental results are compelling.

Weaknesses

Regarding the self-improving model steering, the main operation involves considering K candidate responses and judging them through preferred and disfavored outputs. Then, contrastive training samples are generated. However, this approach lacks sufficient novelty and resembles a technical report. A comparison with previous methods should be included to highlight the differences between your method and the existing approaches. The method utilizes informative question-response pairs to improve sa

Reviewer 03Rating 4Confidence 2

Strengths

- **Extensive experiments:** The evaluation is broad, covering multiple datasets and tasks, with comparisons against standard model-steering approaches. - **Empirical gains:** The proposed self-improving algorithm generally improves upon common model-steering baselines, demonstrating the potential of iterative self-adjustment in steering models.

Weaknesses

- **Inference-time feasibility:** The iterative algorithm may not be practically permissible or efficient at inference time, raising concerns about its real-world applicability. The paper does not clearly specify the additional computational overhead or time cost associated with the self-improvement loop. - **Marginal improvements on some tasks:** In Figure 9, the gains on the NLP benchmark tasks appear minor, and in some cases, performance seems comparable to or slightly worse than standard m

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.