Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms via Batch Prioritized Experience Replay
Dogan C. Cicek, Enes Duran, Baturay Saglam, Furkan B. Mutlu, Suleyman, S. Kozat

TL;DR
This paper introduces KLPER, a novel batch prioritization method for experience replay in deep deterministic policy gradient algorithms, improving sample efficiency and stability by focusing on recent policy-aligned transitions.
Contribution
The paper proposes a new batch-based prioritization algorithm, KLPER, that enhances off-policy correction and efficiency in deep deterministic policy gradient methods.
Findings
KLPER improves sample efficiency in continuous control tasks.
KLPER enhances policy stability during training.
KLPER achieves better final performance compared to baseline algorithms.
Abstract
The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions are generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · Mind wandering and attention · Advanced Bandit Algorithms Research
MethodsExperience Replay
