# Dual experience replay enhanced deep deterministic policy gradient for efficient continuous data sampling

**Authors:** Teh Noranis Mohd Aris, Ningning Chen, Norwati Mustapha, Maslina Zolkepli

PMC · DOI: 10.1371/journal.pone.0334411 · PLOS One · 2025-11-11

## TL;DR

This paper introduces TPDEB, a new reinforcement learning framework that improves learning efficiency and stability in distributed systems by using dual experience replay and prioritized sampling.

## Contribution

The novel dual experience replay framework TPDEB introduces trajectory-level prioritized replay and KL-regularized learning to enhance robustness in asynchronous distributed reinforcement learning.

## Key findings

- TPDEB outperforms baseline algorithms in convergence speed and final performance on MuJoCo benchmarks.
- Trajectory-level prioritization captures higher-quality samples than step-wise methods.
- KL-regularization improves stability during asynchronous updates.

## Abstract

To address the inefficiencies in sample utilization and policy instability in asynchronous distributed reinforcement learning, we propose TPDEB—a dual experience replay framework that integrates prioritized sampling and temporal diversity. While recent distributed RL systems have scaled well, they often suffer from instability and inefficient sampling under network-induced delays and stale policy updates—highlighting a gap in robust learning under asynchronous conditions. TPDEB significantly improves convergence speed and robustness by coordinating dual-buffer updates across distributed agents, offering a scalable solution to real-world continuous control tasks. TPDEB addresses these limitations through two key mechanisms: a trajectory-level prioritized replay buffer that captures temporally coherent high-value experiences, and KL-regularized learning that constrains policy drift across actors. Unlike prior approaches relying on a single experience buffer, TPDEB employs a dual-buffer strategy that combines standard and prioritized replay Buffers. This enables better trade-offs between unbiased sampling and value-driven prioritization, improving learning robustness under asynchronous actor updates. Moreover, TPDEB collects more diverse and redundant experience by scaling parallel actor replicas. Empirical evaluations on MuJoCo continuous control benchmarks demonstrate that TPDEB outperforms baseline distributed algorithms in both convergence speed and final performance, especially under constrained actor–learner bandwidth. Ablation studies validate the contribution of each component, showing that trajectory-level prioritization captures high-quality samples more effectively than step-wise methods, and KL-regularization enhances stability across asynchronous updates. These findings support TPDEB as a practical and scalable solution for distributed reinforcement learning systems.

## Linked entities

- **Species:** Mus musculus (taxon 10090)

## Full-text entities

- **Genes:** SLC25A6 (solute carrier family 25 member 6) [NCBI Gene 293] {aka AAC3, ANT, ANT 2, ANT 3, ANT3, ANT3Y}
- **Diseases:** TD (MESH:C536956)
- **Chemicals:** DOTA (MESH:C071349), TPDEB (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** A3C, A2C

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12604787/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12604787/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/PMC12604787/full.md

---
Source: https://tomesphere.com/paper/PMC12604787