Depth-Wise Activation Steering for Honest Language Models
Gracjan G\'oral, Marysia Winkels, Steven Basart

TL;DR
This paper introduces a training-free activation steering method that improves honesty in large language models by adaptively weighting network layers, demonstrating significant gains across multiple models and benchmarks.
Contribution
The authors propose a novel Gaussian scheduling approach for activation steering that enhances model honesty without retraining or fine-tuning, applicable across various models.
Findings
Gaussian scheduling improves honesty in 6 of 7 models.
Depth-wise intervention distribution significantly impacts outcomes.
Method is simple, model-agnostic, and requires no retraining.
Abstract
Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
