Uncertainty Prioritized Experience Replay
Rodrigo Carrasco-Davis, Sebastian Lee, Claudia Clopath, Will Dabney

TL;DR
This paper introduces a novel prioritized experience replay method that uses epistemic uncertainty to select informative transitions, reducing noise effects and improving learning efficiency in deep reinforcement learning.
Contribution
It proposes using epistemic uncertainty for prioritizing experience replay, addressing noise issues in traditional TD-error based methods, and demonstrates improved performance on Atari benchmarks.
Findings
Epistemic uncertainty prioritization outperforms TD-error in noisy environments.
Method improves sample efficiency in deep reinforcement learning.
Achieves better results than quantile regression DQN on Atari.
Abstract
Prioritized experience replay, which improves sample efficiency by selecting relevant transitions to update parameter estimates, is a crucial component of contemporary value-based deep reinforcement learning models. Typically, transitions are prioritized based on their temporal difference error. However, this approach is prone to favoring noisy transitions, even when the value estimation closely approximates the target mean. This phenomenon resembles the noisy TV problem postulated in the exploration literature, in which exploration-guided agents get stuck by mistaking noise for novelty. To mitigate the disruptive effects of noise in value estimation, we propose using epistemic uncertainty estimation to guide the prioritization of transitions from the replay buffer. Epistemic uncertainty quantifies the uncertainty that can be reduced by learning, hence reducing transitions sampled from…
Peer Reviews
Decision·Submitted to ICLR 2025
1) Motivating examples that clearly show the advantage over PER. 2) The paper is well-written and clearly presented.
1) The method is only compared to Prioritized Experience Replay (PER), which is quite old. There are other approaches that have improved upon PER in recent years, such as [1]. We expected to see comparisons with these newer methods. 2) The paper claims that it enhances sample efficiency; however, it does not demonstrate how their approach contributes to solving one of the most challenging Atari games, Montezuma’s Revenge, in the main paper. An experiment in the appendix reveals no improvements
1. The paper tackles the critical issue of exploration in RL and offers a promising method that uses uncertainty to prioritize interaction data. 2. Comprehensive experiments are provided, ranging from bandits and tabular tasks to Atari tasks.
1. The paper lacks contemporary baselines in Atari games. The baselines compared in this paper are somewhat outdated, while several recent works also consider estimating uncertainty to improve exploration. 2. The writing could be polished further. Some sentences are quite colloquial, such as "this is shown graphically in Figure 14 and Figure 15" in line 394. 3. The paper primarily discusses the advantages of their prioritized variables through verbal descriptions and toy examples. It would benef
1. Reasonable Motivation: The authors introduced two examples: conal bandits and noisy greedy world 2. Easy to Implement but good insights: The authors proposed new formula for uncertainty and a way to compute prioritization using the uncertainty with the concept of information gain. 3. Implementation seems to be not difficult.
1. The authors used QR-DQN as the benchmark. So it is unclear whether this concept is still valid across different distributional Q-learning (e.g., C51, Rainbow) 2. The authors provided compuational costs in Table 1 for each algorithm. But it is unclear the coupuational cost between Random vs PER vs UPER.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Artificial Intelligence in Games
