Word Embeddings Are Steers for Language Models
Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang,, Tarek Abdelzaher, Heng Ji

TL;DR
This paper introduces LM-Steers, a method to control language model generation styles through linear transformations of output word embeddings, demonstrating interpretability, transferability, and effectiveness in style control tasks.
Contribution
The work reveals that linear transformations of output word embeddings can steer language model styles, providing a new interpretable and transferable control mechanism.
Findings
LM-Steers exist in all sizes of language models.
Learning LM-Steers requires only 0.2% of the original model parameters.
LM-Steers achieve competitive results in style control tasks.
Abstract
Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles. We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality. The learned LM-Steer serves as a lens in text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
MethodsBalanced Selection
