
TL;DR
This paper introduces a theoretical explanation for why Reinforcement Learning from AI Feedback (RLAIF) effectively improves language model alignment, linking it to latent human values encoded in model representations and how constitutional prompts activate these values.
Contribution
It proposes the latent value hypothesis and formalizes it with a linear model, providing a unified explanation for RLAIF's effectiveness and limitations based on representation encoding of values.
Findings
RLAIF improves alignment when constitutional prompts activate value directions better than default generation.
The maximum quality of RLAIF depends on how well representations encode human values, which scales with model size.
Adversarial constitutions can activate harmful or anti-social value directions from pretraining data.
Abstract
Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Decision-Making and Behavioral Economics
