Extracting Latent Steering Vectors from Pretrained Language Models
Nishant Subramani, Nivedita Suresh, Matthew E. Peters

TL;DR
This paper demonstrates that pretrained language models contain latent steering vectors that can be extracted and manipulated to control text generation, enabling high-quality sentence targeting, sentiment transfer, and similarity assessment without fine-tuning.
Contribution
The authors introduce a novel method to extract and utilize latent steering vectors from pretrained language models for controllable text generation without additional training.
Findings
Steering vectors enable target sentence generation with >99 BLEU score.
Vector arithmetic allows unsupervised sentiment transfer with competitive performance.
Distances between steering vectors correlate with sentence similarity, outperforming pooled hidden states.
Abstract
Prior work on controllable text generation has focused on learning how to control language models through trainable decoding, smart-prompt design, or fine-tuning based on a desired objective. We hypothesize that the information needed to steer the model to generate a target sentence is already encoded within the model. Accordingly, we explore a different approach altogether: extracting latent vectors directly from pretrained language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly (> 99 BLEU) for English sentences from a variety of domains. We show that vector arithmetic can be used for unsupervised sentiment transfer on the Yelp sentiment benchmark, with performance comparable to models tailored to this task. We find that distances between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
