Steering Language Models with Weight Arithmetic
Constanza Fierro, Fabien Roger

TL;DR
This paper introduces contrastive weight steering, a simple post-training method that edits language model weights to control behaviors like sycophancy and misalignment, often outperforming activation-based methods.
Contribution
The paper proposes a novel weight arithmetic technique for post-training behavior control in LLMs, enabling better out-of-distribution generalization and mitigation of undesired behaviors.
Findings
Weight steering often outperforms activation steering in behavioral control.
It can mitigate sycophancy and behavioral drift during fine-tuning.
Emergent misalignment may be detectable via weight similarity measures.
Abstract
Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading generalâŠ
Peer Reviews
Decision·ICLR 2026 Poster
1. The contrastive weight-space formulation is both simple and effective. 2. Demonstrated across behaviors (sycophancy, evilness, refusal) and architectures. 3. Outperforms activation steering on unseen distributions. 4. Shows that CWS can correct sycophancy induced during task-specific fine-tuning without harming core skills. 5. Provides early evidence that misalignment can be detected by tracking weight-space similarities. 6. Hyperparameters, datasets, and prompts are clearly documented
1. The paper does not formally analyze why certain weight directions correspond to behavioral dimensions. 2. Steering coefficient đ, k is tuned manually; adaptive or learning-based selection could improve reliability. 3. Experiments use models up to 7 B parameters larger frontier models (e.g., 70 B+) could test scalability. 4. It remains unclear how interpretable or modular these weight directions are across unrelated behaviors. 5. The âevil vectorâ similarity experiment is promising but would
1. Methodology: the approach is conceptually simple yet effective, contrastive construction of behavioral directions in weight space, building directly upon and extending well-known task-vector strategies (Ilharco et al., 2023). 2. Clarity with Explicit Comparison: The paper systematically compares contrastive weight steering to activation steering and joint fine-tuning, using diverse alignment-relevant behaviors (sycophancy, refusal, evilness) across several standard LLM architectures (Qwen2.5
1. Oversimplified Method and Ambiguity in Implementation: The proposed method appears conceptually simple, relying on subtracting the negative direction from the positive one and interpolating between them to obtain the final value. However, constructing an appropriate negative direction is often non-trivial and may not be unique (see my Q1), potentially introducing ambiguity in the optimization process. Furthermore, the selection of an appropriate interpolation coefficient k plays a critical ro
- The method is simple & practical: A minimal, reproducible recipe (two tiny LoRA runs + one vector addition) with low data/compute cost. - OOD robustness vs. activation steering: Under matched data and control strength, weight steering more often preserves base accuracy while shifting behavior. - Useful byproducts: The learned behavior direction doubles as a monitoring signal for emergent misalignment.
- Limited novelty: Methodologically close to task arithmetic/task vectors; the contrastive construction is natural but incremental. - Baselines could be stronger: Most comparisons are to activation steering or prompts. To contextualize tradeoffs, it would help to include training-heavier baselines (e.g., larger SFT/RLHF slices) on the same behaviors and report cost-adjusted outcomes.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
