Multi-Attribute Steering of Language Models via Targeted Intervention
Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

TL;DR
This paper introduces MAT-Steer, a novel inference-time intervention framework that enables multi-attribute steering of large language models by learning sparse, orthogonal token-level intervention vectors, effectively balancing conflicting attributes.
Contribution
It proposes a scalable multi-attribute steering method that learns attribute-specific intervention vectors with orthogonality constraints to reduce conflicts during token-level interventions.
Findings
Outperforms existing ITI methods in multi-attribute settings
Achieves 3% average accuracy gain on QA tasks
Attains 55.82% win rate against best ITI baseline
Abstract
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
