In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu, Ke, Chen, Dongbin Zhao

TL;DR
This paper introduces In-Dataset Trajectory Return Regularization (DTR), a method that improves offline preference-based reinforcement learning by reducing reward bias and enhancing policy learning through sequence modeling and ensemble normalization.
Contribution
The paper proposes DTR, a novel regularization technique using sequence modeling and ensemble normalization to address reward bias in offline PbRL, improving policy performance.
Findings
DTR outperforms state-of-the-art baselines on multiple benchmarks.
Ensemble normalization effectively balances reward differentiation and accuracy.
Sequence modeling mitigates reward bias and improves trajectory stitching.
Abstract
Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing
