Value Augmented Sampling for Language Model Alignment and   Personalization

Seungwook Han; Idan Shenfeld; Akash Srivastava; Yoon Kim; Pulkit; Agrawal

arXiv:2405.06639·cs.LG·May 13, 2024

Value Augmented Sampling for Language Model Alignment and Personalization

Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, Pulkit, Agrawal

PDF

Open Access 1 Repo

TL;DR

This paper introduces Value Augmented Sampling (VAS), a novel reward optimization framework that enhances LLM alignment and personalization by maximizing reward functions without modifying the model weights, reducing inference costs and enabling flexible reward composition.

Contribution

VAS provides a stable, efficient method for reward optimization in LLMs that does not require co-training the value function or access to model weights, facilitating adaptation of API-only models like ChatGPT.

Findings

01

VAS outperforms PPO and DPO on standard benchmarks.

02

Achieves results comparable to Best-of-128 with lower inference cost.

03

Enables reward composition and personalized control during deployment.

Abstract

Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idanshen/Value-Augmented-Sampling
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization · Monte-Carlo Tree Search