Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward
Tengyu Xu, Yue Wang, Shaofeng Zou, Yingbin Liang

TL;DR
This paper introduces PARTED, an offline RL algorithm that effectively utilizes trajectory-wise rewards by reward decomposition and pessimistic value iteration, achieving provable efficiency in general MDPs.
Contribution
It proposes a novel reward redistribution method and a pessimistic value iteration framework for offline RL with trajectory-wise rewards, providing theoretical guarantees.
Findings
Achieves suboptimality of D_{ ext{eff}}H^2/\u007FN with neural networks.
Matches linear MDP suboptimality of dH^3/N.
First provably efficient offline RL algorithm for general MDPs with trajectory-wise rewards.
Abstract
The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the {\em trajectory-wise reward}. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Adversarial Robustness in Machine Learning
