Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing
Ge Liu, Rui Wu, Heng-Tze Cheng, Jing Wang, Jayden Ooi, Lihong Li, Ang, Li, Wai Lok Sibon Li, Craig Boutilier, Ed Chi

TL;DR
This paper introduces ABPS, a data-efficient reinforcement learning training method that shares experience among agents with adaptively selected policies, reducing hyper-parameter tuning costs and improving performance.
Contribution
The paper proposes ABPS, a novel adaptive experience sharing algorithm, and extends it with ABPS-PBT for hyper-parameter evolution, enhancing data efficiency and convergence speed in RL training.
Findings
ABPS outperforms traditional hyper-parameter tuning in Atari games.
ABPS reduces variance among top agents.
ABPS-PBT accelerates convergence and further reduces variance.
Abstract
Deep Reinforcement Learning (RL) is proven powerful for decision making in simulated environments. However, training deep RL model is challenging in real world applications such as production-scale health-care or recommender systems because of the expensiveness of interaction and limitation of budget at deployment. One aspect of the data inefficiency comes from the expensive hyper-parameter tuning when optimizing deep neural networks. We propose Adaptive Behavior Policy Sharing (ABPS), a data-efficient training algorithm that allows sharing of experience collected by behavior policy that is adaptively selected from a pool of agents trained with an ensemble of hyper-parameters. We further extend ABPS to evolve hyper-parameters during training by hybridizing ABPS with an adapted version of Population Based Training (ABPS-PBT). We conduct experiments with multiple Atari games with up to 16…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Data Stream Mining Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Population Based Training
