Multi-Attribute Steering of Language Models via Targeted Intervention

Duy Nguyen; Archiki Prasad; Elias Stengel-Eskin; Mohit Bansal

arXiv:2502.12446·cs.CL·July 10, 2025

Multi-Attribute Steering of Language Models via Targeted Intervention

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

PDF

Open Access 1 Video

TL;DR

This paper introduces MAT-Steer, a novel inference-time intervention framework that enables multi-attribute steering of large language models by learning sparse, orthogonal token-level intervention vectors, effectively balancing conflicting attributes.

Contribution

It proposes a scalable multi-attribute steering method that learns attribute-specific intervention vectors with orthogonality constraints to reduce conflicts during token-level interventions.

Findings

01

Outperforms existing ITI methods in multi-attribute settings

02

Achieves 3% average accuracy gain on QA tasks

03

Attains 55.82% win rate against best ITI baseline

Abstract

Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-Attribute Steering of Language Models via Targeted Intervention· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling