List Replicable Reinforcement Learning

Bohan Zhang; Michael Chen; A. Pavan; N. V. Vinodchandran; Lin F. Yang; Ruosong Wang

arXiv:2512.00553·cs.LG·December 2, 2025

List Replicable Reinforcement Learning

Bohan Zhang, Michael Chen, A. Pavan, N. V. Vinodchandran, Lin F. Yang, Ruosong Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a formal framework for list replicability in reinforcement learning, proposing algorithms that ensure consistent policy outputs across runs, thereby addressing RL instability and sensitivity issues.

Contribution

The paper presents the first provably efficient RL algorithms guaranteeing list replicability with polynomial list complexity, including strong forms that constrain entire policy sequences.

Findings

01

Proposes a novel planning strategy based on lexicographic ordering.

02

Develops mechanisms for testing state reachability while maintaining replicability.

03

Demonstrates empirical incorporation of the strategy to improve RL stability.

Abstract

Replicability is a fundamental challenge in reinforcement learning (RL), as RL algorithms are empirically observed to be unstable and sensitive to variations in training conditions. To formally address this issue, we study \emph{list replicability} in the Probably Approximately Correct (PAC) RL framework, where an algorithm must return a near-optimal policy that lies in a \emph{small list} of policies across different runs, with high probability. The size of this list defines the \emph{list complexity}. We introduce both weak and strong forms of list replicability: the weak form ensures that the final learned policy belongs to a small list, while the strong form further requires that the entire sequence of executed policies remains constrained. These objectives are challenging, as existing RL algorithms exhibit exponential list complexity due to their instability. Our main theoretical…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper is very well written and the novel techniques are clearly explained. This is to the best of my knowledge the first work to consider the question of list-replicable RL, and so contributes to our theoretical understanding of the feasibility of stable RL. The approach to stable planning is a nice contribution in that it is simple enough to be adapted to a variety of algorithms and empirically improves stability, a common problem for empirical RL.

Weaknesses

My concerns regarding the results are mostly about comparison to prior results in replicable RL. First, this paper omits reference to related work [1] that seems algorithmically similar. [1] improves upon prior work on replicable RL by giving more sample efficient replicable algorithms in the tabular setting. Their results also rely on this idea of stably learning a collection of ignorable states, then doing backward induction with data collected from unignorable states to learn a good policy.

Reviewer 02Rating 6Confidence 3

Strengths

- **Quality:** The paper’s main claims are supported by rigorous and well-structured proofs, demonstrating a solid theoretical foundation. Additionally, the empirical results, while limited in scope, align well with the theoretical claims and demonstrate practical relevance. - **Clarity:** The paper's central contribution—introducing list replicability as a performance criterion in RL alongside efficient algorithms—is both well-motivated and clearly presented. The exposition is supported by intu

Weaknesses

Rather than separating into broad quality, clarity, significance and originality categories, I will outline my main concerns in a more detailed manner below. - Theorem 1.3 establishes a lower bound on list complexity of $\Omega(SAH)$ for weakly list-replicable RL algorithms, which is notably lower than the upper bounds achieved by Algorithms 2 and 3. This discrepancy raises the question of whether the proposed algorithms admit non-tight bounds that could be further improved. A discussion address

Reviewer 03Rating 6Confidence 4

Strengths

(+) This paper extends replicable RL to the MDP setup. The analysis and results are technically solid.

Weaknesses

(-) The numerical results cannot reflect the replicability of the algorithm.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques