Depth-Wise Activation Steering for Honest Language Models

Gracjan G\'oral; Marysia Winkels; Steven Basart

arXiv:2512.07667·cs.LG·December 9, 2025

Depth-Wise Activation Steering for Honest Language Models

Gracjan G\'oral, Marysia Winkels, Steven Basart

PDF

Open Access

TL;DR

This paper introduces a training-free activation steering method that improves honesty in large language models by adaptively weighting network layers, demonstrating significant gains across multiple models and benchmarks.

Contribution

The authors propose a novel Gaussian scheduling approach for activation steering that enhances model honesty without retraining or fine-tuning, applicable across various models.

Findings

01

Gaussian scheduling improves honesty in 6 of 7 models.

02

Depth-wise intervention distribution significantly impacts outcomes.

03

Method is simple, model-agnostic, and requires no retraining.

Abstract

Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling