AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action   Control

Seungyul Han; Youngchul Sung

arXiv:1710.04423·cs.LG·October 3, 2018·5 cites

AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control

Seungyul Han, Youngchul Sung

PDF

Open Access

TL;DR

This paper introduces AMBER, an adaptive multi-batch experience replay scheme for PPO that enhances convergence speed and stability in continuous control tasks by adaptively utilizing past policy data based on importance sampling weights.

Contribution

The paper presents a novel adaptive multi-batch experience replay method integrated with PPO, improving learning efficiency and stability in continuous action control.

Findings

01

Significantly faster convergence in continuous control tasks.

02

Enhanced stability of PPO with the new replay scheme.

03

Maintains low bias through importance sampling and advantage storage.

Abstract

In this paper, a new adaptive multi-batch experience replay scheme is proposed for proximal policy optimization (PPO) for continuous action control. On the contrary to original PPO, the proposed scheme uses the batch samples of past policies as well as the current policy for the update for the next policy, where the number of the used past batches is adaptively determined based on the oldness of the past batches measured by the average importance sampling (IS) weight. The new algorithm constructed by combining PPO with the proposed multi-batch experience replay scheme maintains the advantages of original PPO such as random mini-batch sampling and small bias due to low IS weights by storing the pre-computed advantages and values and adaptively determining the mini-batch size. Numerical results show that the proposed method significantly increases the speed and stability of convergence on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Entropy Regularization · Proximal Policy Optimization · Experience Replay