Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting
Zhang-Wei Hong, Pulkit Agrawal, R\'emi Tachet des Combes, Romain, Laroche

TL;DR
This paper introduces a trajectory re-weighting method for offline reinforcement learning that enhances policy performance by better exploiting high-return trajectories in mixed datasets, applicable across various algorithms and environments.
Contribution
The paper proposes a novel re-weighting sampling strategy to improve offline RL performance by emphasizing high-return trajectories, compatible with existing algorithms.
Findings
Re-weighted sampling improves policy performance in mixed datasets.
The approach enhances exploitation of high-return trajectories.
Effective even in stochastic environments despite theoretical limitations.
Abstract
Most offline reinforcement learning (RL) algorithms return a target policy maximizing a trade-off between (1) the expected performance gain over the behavior policy that collected the dataset, and (2) the risk stemming from the out-of-distribution-ness of the induced state-action occupancy. It follows that the performance of the target policy is strongly related to the performance of the behavior policy and, thus, the trajectory return distribution of the dataset. We show that in mixed datasets consisting of mostly low-return trajectories and minor high-return trajectories, state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit high-performing trajectories to the fullest. To overcome this issue, we show that, in deterministic MDPs with stochastic initial states, the dataset sampling can be re-weighted to induce an artificial dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
Methodsfail · Implicit Q-Learning
