Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Narmeen Oozeer, Luke Marks, Shreyans Jain, Fazl Barez, Amirali Abdullah

TL;DR
K-Steering is a novel method for multi-attribute control of large language models that uses a non-linear classifier and gradient-based interventions, outperforming linear methods and enabling dynamic behavior composition.
Contribution
Introduces K-Steering, a unified non-linear approach for multi-attribute control in LLMs, with new benchmarks and superior empirical performance.
Findings
K-Steering outperforms linear methods in multi-attribute control.
It enables dynamic composition of behaviors without retraining.
Validated across multiple model families and evaluation metrics.
Abstract
Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
