Reliability-Adjusted Prioritized Experience Replay

Leonard S. Pleiss; Tobias Sutter; Maximilian Schiffer

arXiv:2506.18482·cs.LG·December 16, 2025

Reliability-Adjusted Prioritized Experience Replay

Leonard S. Pleiss, Tobias Sutter, Maximilian Schiffer

PDF

3 Reviews

TL;DR

This paper introduces ReaPER, an extension of Prioritized Experience Replay that uses a new measure of transition reliability to improve learning efficiency in reinforcement learning agents, demonstrated through theoretical and empirical results.

Contribution

The paper proposes a novel reliability measure for transitions in PER, enhancing sampling efficiency and learning performance in reinforcement learning.

Findings

01

ReaPER outperforms PER in various environments.

02

Theoretical analysis shows improved learning efficiency.

03

Empirical results on Atari-10 benchmark confirm effectiveness.

Abstract

Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-10 benchmark.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- This is a good paper and it should be accepted. Its writing is clear, coherent, and well organized. The ideas are easy to grasp and follow with good justification and explanation for most design choices. The authors use illustrative examples and theoretical justification to support their proposed reliability measure and its incorporation in to PER - The paper does not overreach in its claims and its conclusions mostly match the provided evidence. The theoretical results are presented with eas

Weaknesses

- Most of the justification and intuition behind ReaPER and the proposed reliability score is presented in the tabular setting. Non-linear generalization can wildly change the q-values during learning especially when the data distribution is being modified. Including a discussion of how this reliability measure interacts with non-linear generalization (oversampling certain transitions, higher noise in predicted values, overgeneralization, etc.) would better support and justify the ReaPER algorit

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is **well-written and clearly structured**. - theoretical foundation linking reliability scores to TDE target biases. - Clear methodological improvements over standard PER. - Algorithm is model-agnostic and straightforward to implement in existing off-policy methods.

Weaknesses

1. **Limited baseline comparisons.** The paper primarily compares ReaPER only against PER. While the authors justify this by PER’s ongoing adoption in SOTA systems (e.g., Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update) would make the empirical evidence more convincing. 2. ReaPER does **not robustly address episode-length variance**, limiting its generalizability. This introduces severe biases and undermines reliability scores, making them heavily dependent on arbitrar

Reviewer 03Rating 2Confidence 5

Strengths

1. The proposed concept is well-motivated with theoretical analysis and easy to implement on top of prioritized experience replay framework. 2. The empirical analysis on both continuous control and Atari-10 shows robust and superior performance against all the baselines. 3. The writing is logically fluent and easy to follow.

Weaknesses

1. The assumption appears overly strong. Specifically, Assumption 3.4 relies on a bias bound that presumes near-optimal trajectories. During training, policies are far from optimal, so this bound rarely holds early on. Moreover, function approximation and bootstrapping introduce high variance and correlation in TD errors, making the downstream sum an unreliable proxy for bias. Especially, in partially observed environments, future TD errors may fluctuate unpredictably. 2. The empirical experime

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.