Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection
Minjae Kang, Jaehyung Kim

TL;DR
This paper introduces DIRECTER, a dynamic activation steering method that adaptively modulates model internals to improve instruction-following in LLMs without sacrificing output quality.
Contribution
We propose a novel dynamic steering technique that adjusts activation influence based on plausibility, enhancing instruction-following while avoiding oversteering.
Findings
Improves instruction-following accuracy by up to 6.5%
Effectively mitigates oversteering without degrading output quality
Demonstrates broad applicability across diverse benchmarks
Abstract
Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper accurately identifies the "oversteering" problem in existing activation steering techniques. The proposed DIRECTER provides a dynamic, adaptive control mechanism rather than relying on static, manually-tuned hyperparameters. - The results strongly demonstrate the method's key advantage: it significantly improves instruction following without sacrificing core task accuracy or text quality. - The evaluation is thorough, using diverse benchmarks including IFEval (strict instructions), L
- RQ2 requires the generalization of DIRECTER across different architectures and model sizes. To make the comparison more complete, Table 2 should also include results of other steering-based methods such as PASTA and SpotLight on more LLM architecture (e.g., Qwen) . - The choice of the probability-ratio–based plausibility criterion (Eq. 2) is currently mainly empirical. However, commonly used measures for comparing probability distributions include KL divergence or JS divergence. The authors s
1. The plausibility-guided decoding loop is conceptually simple yet effective. Its dynamic halving mechanism mimics reinforcement learning–style adaptive control but without extra training. 2. The method introduces minimal memory overhead and modest throughput reduction (≈ 16%) 3. The extensive ablation and robustness analyses (Fig. 2–3) convincingly demonstrate the benefits of dynamic steering and the effectiveness of layer ranking.
1. The attention sensitivity metric (Eq. 3–4) is ad-hoc and lacks theoretical justification. The “direct” and “propagated” effects are computed via cosine distance differences, but no intuition or derivation is provided. Why does summing cosine-distance deviations across layers accurately capture “influence”? 2. The method section is hard to follow due to poor narrative flow, cross references and undefined notation. For example, Readers are told what each symbol means after it’s used — e.g., $𝐿_
This work is well-written and scoped, and provides solid empirical evidence for their proposed control methodology. DIRECTER avoids the problem of "over-steering," which is present in other steering methods applied broadly to the input. The authors show that their method is successful across different benchmarks, model scales, and can be used successfully in conjunction with many different architectures. Importantly, they show that DIRECTER is efficient and performs well against simpler interven
* Fixed plausibility threshold, while effective, is applied globally based on a hyperparameter sweep on a subset of the data, but it's unclear in the paper whether this generalizes across models or data domain types. A per-task adaptive value or assessment could help validate the applicability of this parameter. * The LLM-based evaluation without human validation is a limitation, given the claims of text quality. Reliability of this as a metric should have a subset assessed manually to validate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Big Data and Digital Economy
