Efficient Preference-based Reinforcement Learning via Aligned Experience   Estimation

Fengshuo Bai; Rui Zhao; Hongming Zhang; Sijia Cui; Ying Wen; Yaodong; Yang; Bo Xu; Lei Han

arXiv:2405.18688·cs.LG·May 30, 2024

Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

Fengshuo Bai, Rui Zhao, Hongming Zhang, Sijia Cui, Ying Wen, Yaodong, Yang, Bo Xu, Lei Han

PDF

Open Access

TL;DR

SEER is an efficient preference-based reinforcement learning method that reduces human feedback needs by integrating label smoothing and conservative Q-estimation, leading to improved performance and sample efficiency.

Contribution

The paper introduces SEER, a novel PbRL approach that enhances feedback efficiency through label smoothing and conservative Q-estimation, outperforming existing methods.

Findings

01

SEER outperforms state-of-the-art PbRL methods in complex tasks.

02

SEER achieves more accurate Q-function estimates.

03

SEER requires less human feedback for effective learning.

Abstract

Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. However, a notable limitation of PbRL is its dependency on substantial human feedback. This dependency stems from the learning loop, which entails accurate reward learning compounded with value/policy learning, necessitating a considerable number of samples. To boost the learning loop, we propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques. Label smoothing reduces overfitting of the reward model by smoothing human preference labels. Additionally, we bootstrap a conservative estimate $Q$ using well-supported state-action pairs from the current replay memory to mitigate overestimation bias and utilize it for policy learning regularization. Our experimental results across a variety of complex tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Grid Energy Management · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · LARS · Squeeze-and-Excitation Block · 1x1 Convolution · Dense Connections · Average Pooling · Global Average Pooling · Grouped Convolution · Batch Normalization