In-Dataset Trajectory Return Regularization for Offline Preference-based   Reinforcement Learning

Songjun Tu; Jingbo Sun; Qichao Zhang; Yaocheng Zhang; Jia Liu; Ke; Chen; Dongbin Zhao

arXiv:2412.09104·cs.AI·December 24, 2024

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu, Ke, Chen, Dongbin Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces In-Dataset Trajectory Return Regularization (DTR), a method that improves offline preference-based reinforcement learning by reducing reward bias and enhancing policy learning through sequence modeling and ensemble normalization.

Contribution

The paper proposes DTR, a novel regularization technique using sequence modeling and ensemble normalization to address reward bias in offline PbRL, improving policy performance.

Findings

01

DTR outperforms state-of-the-art baselines on multiple benchmarks.

02

Ensemble normalization effectively balances reward differentiation and accuracy.

03

Sequence modeling mitigates reward bias and improves trajectory stitching.

Abstract

Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TU2021/DTR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing