Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Amr Hegazy; Mostafa Elhoushi; Amr Alanwar

arXiv:2505.20309·cs.CL·March 17, 2026

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Amr Hegazy, Mostafa Elhoushi, Amr Alanwar

PDF

Open Access

TL;DR

This paper introduces a lightweight, trainable inference-time controller for LLMs that dynamically modulates behavior using activation steering, significantly improving safety and control without fine-tuning the models.

Contribution

A novel, adaptive activation steering method with a lightweight controller that modulates LLM outputs during inference based on intermediate activations.

Findings

01

Significantly increased refusal rates on safety benchmarks

02

Outperforms existing activation steering methods

03

Effective across multiple LLM architectures

Abstract

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization