Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms   via Batch Prioritized Experience Replay

Dogan C. Cicek; Enes Duran; Baturay Saglam; Furkan B. Mutlu; Suleyman; S. Kozat

arXiv:2111.01865·cs.LG·November 15, 2021

Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms via Batch Prioritized Experience Replay

Dogan C. Cicek, Enes Duran, Baturay Saglam, Furkan B. Mutlu, Suleyman, S. Kozat

PDF

Open Access

TL;DR

This paper introduces KLPER, a novel batch prioritization method for experience replay in deep deterministic policy gradient algorithms, improving sample efficiency and stability by focusing on recent policy-aligned transitions.

Contribution

The paper proposes a new batch-based prioritization algorithm, KLPER, that enhances off-policy correction and efficiency in deep deterministic policy gradient methods.

Findings

01

KLPER improves sample efficiency in continuous control tasks.

02

KLPER enhances policy stability during training.

03

KLPER achieves better final performance compared to baseline algorithms.

Abstract

The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions are generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAge of Information Optimization · Mind wandering and attention · Advanced Bandit Algorithms Research

MethodsExperience Replay