ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

Hongjue Zhao; Haosen Sun; Jiangtao Kong; Xiaochang Li; Qineng Wang; Liwei Jiang; Qi Zhu; Tarek Abdelzaher; Yejin Choi; Manling Li; Huajie Shao

arXiv:2602.17560·cs.AI·February 24, 2026

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ODESteer, an ODE-based activation steering framework for LLM alignment that unifies theoretical understanding and demonstrates empirical improvements over existing methods.

Contribution

It provides a unified ODE-based theoretical framework for activation steering and introduces ODESteer, a novel method guided by barrier functions for improved LLM alignment.

Findings

01

Achieves up to 5.7% improvement on TruthfulQA

02

Demonstrates consistent empirical gains over state-of-the-art methods

03

Validates the ODE framework as a principled approach to activation steering

Abstract

Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: (i) the lack of a unified theoretical framework for guiding the design of steering directions, and (ii) an over-reliance on one-step steering that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a barrier function from control theory. Derived from this framework, we introduce…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The proposed ODE-based multi-step activation steering is well-motivated and demonstrates a meaningful degree of novelty. 2. The experiments cover multiple models and multiple tasks, clearly showing that the proposed method outperforms one-step activation steering baselines.

Weaknesses

1. The proposed method requires multi-step ODE integration, which introduces additional computational overhead during inference. The paper would benefit from a more detailed analysis and discussion of this extra cost. 2. Some implementation details are insufficiently specified. During training, the method relies on collecting positive and negative activations, but the paper does not clearly describe how these activations are extracted, particularly which token positions are used. Additionally,

Reviewer 02Rating 6Confidence 4

Strengths

1. The barrier-function perspective is well-grounded in control theory, offering a principled view that unifies existing steering methods. 2. The paper is clearly written, and empirical results are promising, showing consistent gains across multiple models and datasets.

Weaknesses

1. The claimed advantage over output-optimization methods is debatable: both approaches rely on a learned scoring function—the proposed method’s barrier function also requires accurate estimation and introduces additional hyperparameters (e.g., step size, number of ODE steps, solver type, and polynomial sketch settings) 2. It is unclear whether the ODE formulation is necessary. Could similar results be achieved by taking several gradient ascent steps on the barrier function, which might be mor

Reviewer 03Rating 2Confidence 3

Strengths

1. The paper introduces a theoretically grounded and unified ODE-based framework for activation steering. 2. It enables adaptive, stable, and efficient alignment of LLMs without any retraining. 3. The method shows consistent improvements across truthfulness and safety benchmarks

Weaknesses

1. **Minimal performance improvement:** The reported gains (≈2–7%) are small and may not be statistically significant. 2. **Lack of clarity on $p_+$ and $p_-$:** The paper doesn’t specify how positive and negative activations are categorized or sampled. 3. **Unverified barrier property:** It is not shown that the learned barrier function $h(a)$ satisfies Proposition $$\dot{h}(a)=\nabla h(a)^\top v(a)>0$$ in practice. 4. **No trajectory-level likelihood analysis:** The paper doesn’t measu

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods