Representation Tuning
Christopher M. Ackerman

TL;DR
This paper introduces a method called representation tuning that directly incorporates behavioral vectors into large language models during training, enabling more effective and safer control over model honesty compared to online steering.
Contribution
The work presents a novel approach to embedding behavioral control vectors into LLMs through fine-tuning with a dual loss, improving safety and control over honesty in generated outputs.
Findings
Representation tuning outperforms online steering in honesty control.
Fine-tuning with cosine similarity enhances model safety.
The approach generalizes better than standard token-based fine-tuning.
Abstract
Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Spam and Phishing Detection · Explainable Artificial Intelligence (XAI)
