AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control
Seungyul Han, Youngchul Sung

TL;DR
This paper introduces AMBER, an adaptive multi-batch experience replay scheme for PPO that enhances convergence speed and stability in continuous control tasks by adaptively utilizing past policy data based on importance sampling weights.
Contribution
The paper presents a novel adaptive multi-batch experience replay method integrated with PPO, improving learning efficiency and stability in continuous action control.
Findings
Significantly faster convergence in continuous control tasks.
Enhanced stability of PPO with the new replay scheme.
Maintains low bias through importance sampling and advantage storage.
Abstract
In this paper, a new adaptive multi-batch experience replay scheme is proposed for proximal policy optimization (PPO) for continuous action control. On the contrary to original PPO, the proposed scheme uses the batch samples of past policies as well as the current policy for the update for the next policy, where the number of the used past batches is adaptively determined based on the oldness of the past batches measured by the average importance sampling (IS) weight. The new algorithm constructed by combining PPO with the proposed multi-batch experience replay scheme maintains the advantages of original PPO such as random mini-batch sampling and small bias due to low IS weights by storing the pre-computed advantages and values and adaptively determining the mini-batch size. Numerical results show that the proposed method significantly increases the speed and stability of convergence on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Entropy Regularization · Proximal Policy Optimization · Experience Replay
