Aligning Large Language Models with Human Preferences through Representation Engineering
Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan, Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang

TL;DR
This paper introduces RAHF, a novel representation engineering approach that aligns large language models with human preferences by transforming internal representations, offering a more stable and versatile alternative to reinforcement learning from human feedback.
Contribution
The study presents RAHF, a new method for aligning LLMs with human preferences through representation manipulation, improving stability, efficiency, and flexibility over existing techniques.
Findings
RAHF effectively captures human preferences within model representations.
RAHF allows precise control and manipulation of model behavior.
Experiments show RAHF outperforms traditional RLHF in stability and versatility.
Abstract
Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLHF) to fine-tune LLMs based on human labels assessing the relative quality of model responses. Nevertheless, RLHF is susceptible to instability during fine-tuning and presents challenges in implementation.Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM, and achieve precise control of model behavior by transforming its representations. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems
MethodsALIGN
