PARL: A Unified Framework for Policy Alignment in Reinforcement Learning   from Human Feedback

Souradip Chakraborty; Amrit Singh Bedi; Alec Koppel; Dinesh Manocha,; Huazheng Wang; Mengdi Wang; and Furong Huang

arXiv:2308.02585·cs.LG·May 2, 2024·1 cites

PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback

Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha,, Huazheng Wang, Mengdi Wang, and Furong Huang

PDF

Open Access

TL;DR

This paper introduces PARL, a novel bilevel optimization framework for policy alignment in reinforcement learning using human feedback, addressing distribution shift issues and improving sample efficiency.

Contribution

It formulates RLHF as a bilevel optimization problem, the first of its kind, and develops an algorithm A-PARL with proven sample complexity bounds to enhance policy alignment.

Findings

01

Significant sample efficiency improvements up to 63%

02

Addresses distribution shift in RLHF

03

Demonstrates effectiveness on Deepmind control and Meta world tasks

Abstract

We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

Methodsfail