Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy, Mostafa Elhoushi, Amr Alanwar

TL;DR
This paper introduces a lightweight, trainable inference-time controller for LLMs that dynamically modulates behavior using activation steering, significantly improving safety and control without fine-tuning the models.
Contribution
A novel, adaptive activation steering method with a lightweight controller that modulates LLM outputs during inference based on intermediate activations.
Findings
Significantly increased refusal rates on safety benchmarks
Outperforms existing activation steering methods
Effective across multiple LLM architectures
Abstract
Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization
