TL;DR
SIMS introduces a self-improving framework for model steering that eliminates the need for external supervision by autonomously generating and refining contrastive samples, leading to more adaptable and effective alignment of large language models.
Contribution
The paper presents SIMS, the first self-improving model-steering method that operates without external annotations, using iterative self-refinement and novel strategies to enhance LLM alignment.
Findings
Outperforms existing steering methods in effectiveness
Demonstrates adaptability across diverse LLMs and benchmarks
Reduces reliance on external annotated data
Abstract
Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper introduces a new perspective on inference-time alignment by combining ideas from activation steering and self-improvement loops. 2. The results show consistent improvement, across the iterations
1. The paper frames SIMS as an inference-time alignment method, yet Eq. (5) optimizes the policy $\pi$, implying a parameter update process rather than a pure activation-level intervention. The method for solving the arg min / arg max over π is never described, leaving the optimization step undefined. 2. Eq. (5) includes a normalization constant $Z_{\pi_t}$ with no analytical form or computational approximation. The absence of this detail makes the main optimization equation incomplete and unre
This paper is well-written. This paper proposes a new adaptive model steering framework, leveraging iterative self-improvement cycles that do not rely on external supervision. The experimental results are compelling.
Regarding the self-improving model steering, the main operation involves considering K candidate responses and judging them through preferred and disfavored outputs. Then, contrastive training samples are generated. However, this approach lacks sufficient novelty and resembles a technical report. A comparison with previous methods should be included to highlight the differences between your method and the existing approaches. The method utilizes informative question-response pairs to improve sa
- **Extensive experiments:** The evaluation is broad, covering multiple datasets and tasks, with comparisons against standard model-steering approaches. - **Empirical gains:** The proposed self-improving algorithm generally improves upon common model-steering baselines, demonstrating the potential of iterative self-adjustment in steering models.
- **Inference-time feasibility:** The iterative algorithm may not be practically permissible or efficient at inference time, raising concerns about its real-world applicability. The paper does not clearly specify the additional computational overhead or time cost associated with the self-improvement loop. - **Marginal improvements on some tasks:** In Figure 9, the gains on the NLP benchmark tasks appear minor, and in some cases, performance seems comparable to or slightly worse than standard m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
