Partial Policy Gradients for RL in LLMs
Puneet Mathur, Branislav Kveton, Subhojyoti Mukherjee, Viet Dac Lai

TL;DR
This paper introduces a new approach to reinforcement learning in large language models by optimizing for subsets of future rewards, enabling the modeling and comparison of various policy structures with improved learning reliability.
Contribution
It proposes a novel method for modeling policy structures in RL by focusing on smaller reward subsets, facilitating the comparison of different policy classes in LLMs.
Findings
Different policies perform best on different tasks.
Smaller reward subsets lead to more reliable gradient estimates.
The approach enables effective modeling of various policy types.
Abstract
Reinforcement learning is a framework for learning to act sequentially in an unknown environment. We propose a natural approach for modeling policy structure in policy gradients. The key idea is to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate. Our approach allows for modeling and comparison of different policy classes, including full planning, greedy, K-step lookahead, and segment policies. We evaluate the policies empirically on multiple persona-alignment conversational problems. Different policies excel in different problems, reflecting their different characteristics and highlighting the importance of our studied policy class.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Machine Learning in Healthcare · Recommender Systems and Techniques
