Representation Tuning

Christopher M. Ackerman

arXiv:2409.06927·cs.LG·November 26, 2024

Representation Tuning

Christopher M. Ackerman

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method called representation tuning that directly incorporates behavioral vectors into large language models during training, enabling more effective and safer control over model honesty compared to online steering.

Contribution

The work presents a novel approach to embedding behavioral control vectors into LLMs through fine-tuning with a dual loss, improving safety and control over honesty in generated outputs.

Findings

01

Representation tuning outperforms online steering in honesty control.

02

Fine-tuning with cosine similarity enhances model safety.

03

The approach generalizes better than standard token-based fine-tuning.

Abstract

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cma1114/representation_tuning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Spam and Phishing Detection · Explainable Artificial Intelligence (XAI)