Aligning Large Language Models with Human Preferences through   Representation Engineering

Wenhao Liu; Xiaohua Wang; Muling Wu; Tianlong Li; Changze Lv; Zixuan; Ling; Jianhao Zhu; Cenyuan Zhang; Xiaoqing Zheng; Xuanjing Huang

arXiv:2312.15997·cs.CL·July 4, 2024·1 cites

Aligning Large Language Models with Human Preferences through Representation Engineering

Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan, Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces RAHF, a novel representation engineering approach that aligns large language models with human preferences by transforming internal representations, offering a more stable and versatile alternative to reinforcement learning from human feedback.

Contribution

The study presents RAHF, a new method for aligning LLMs with human preferences through representation manipulation, improving stability, efficiency, and flexibility over existing techniques.

Findings

01

RAHF effectively captures human preferences within model representations.

02

RAHF allows precise control and manipulation of model behavior.

03

Experiments show RAHF outperforms traditional RLHF in stability and versatility.

Abstract

Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLHF) to fine-tune LLMs based on human labels assessing the relative quality of model responses. Nevertheless, RLHF is susceptible to instability during fine-tuning and presents challenges in implementation.Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM, and achieve precise control of model behavior by transforming its representations. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuamber/rahf
pytorchOfficial

Models

🤗
Liuwenhao2022/Mistral-7B-LoRA-RAHF-DUAL
model· 10 dl· ♡ 1
10 dl♡ 1

Videos

Aligning Large Language Models with Human Preferences through Representation Engineering· underline

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems

MethodsALIGN