Sample-Efficient Policy Constraint Offline Deep Reinforcement Learning based on Sample Filtering
Yuanhao Chen, Qi Liu, Pengbin Chen, Zhongjian Qiao, Yanjie Li

TL;DR
This paper introduces a sample filtering technique for offline deep reinforcement learning that enhances sample efficiency and performance by selecting high-quality transitions based on episode rewards, addressing the distribution shift problem.
Contribution
It proposes a simple, effective sample filtering method that improves the training process of offline RL algorithms by focusing on high-reward transitions, leading to better performance.
Findings
Outperforms baseline methods on benchmark tasks.
Improves sample efficiency in offline RL.
Enhances final policy performance.
Abstract
Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected return using a given static dataset of transitions. However, offline RL faces the distribution shift problem. The policy constraint offline RL method is proposed to solve the distribution shift problem. During the policy constraint offline RL training, it is important to ensure the difference between the learned policy and behavior policy within a given threshold. Thus, the learned policy heavily relies on the quality of the behavior policy. However, a problem exists in existing policy constraint methods: if the dataset contains many low-reward transitions, the learned will be contained with a suboptimal reference policy, leading to slow learning speed, low sample efficiency, and inferior performances. This paper shows that the sampling method in policy constraint offline RL that uses all the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Adaptive Dynamic Programming Control
