The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective

Michael Muehlebach; Zhiyu He; Michael I. Jordan

arXiv:2501.15910·cs.LG·March 2, 2026

The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective

Michael Muehlebach, Zhiyu He, Michael I. Jordan

PDF

Open Access 3 Reviews

TL;DR

This paper analyzes the sample complexity of online reinforcement learning for nonlinear dynamical systems with continuous spaces, providing regret bounds for various model classes and highlighting practical algorithm features.

Contribution

It introduces a unified analysis of sample complexity for diverse nonlinear dynamical models, extending previous results to more general settings.

Findings

01

Policy regret of O(N ε^2 + d_u ln(m(ε))/ε^2) in general models

02

Regret of O(√(d_u N p)) for parametrized models like neural networks

03

Algorithms are simple, incorporate prior knowledge, and have benign transients

Abstract

We study the sample complexity of online reinforcement learning in the general \hzyrev{non-episodic} setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $O (N ϵ^{2} + d_{u} ln (m (ϵ)) / ϵ^{2})$ , where $N$ is the time horizon, $ϵ$ is a user-specified discretization width, $d_{u}$ the input dimension, and $m (ϵ)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The paper studies non-episodic RL for a class of nonlinear dynamical systems that is more general than lots of related work. This problem is clearly on the frontier of RL theory. The proposed techniques use a nice mixture of online learning theory, Bayesian methods, and nonlinear control theory. The paper should be of interest to researchers with both RL and control backgrounds. The main assumptions (besides ignoring computational cost) are Assumption 1, which is related to Bellman optimality a

Weaknesses

In the Theorem 1 statement, it is a bit confusing to see the equation (1) suddenly called "$\mathcal{H}_2$ gain", maybe it is equivalent to the classic $\mathcal{H}_2$ gain for linear dynamics and $l(\cdot,\cdot)$ quadratic, but this version is still unfamiliar and RL audiences definitely won't know it. In general, the paper seems to assume a level of familiarity with classic control theory that the ICLR audience may not possess; it would improve the paper to do a bit more hand-holding. Recover

Reviewer 02Rating 6Confidence 4

Strengths

1. The discussion of related work is exceptionally clear, and the citations appear comprehensive. 2. The theoretical analysis is rigorous. 3. The paper is well organized, and the narrative progresses with a coherent, reader-friendly logic.

Weaknesses

While I do not see any glaring flaws, the following points prevent a stronger recommendation: 1. Under the stated assumptions, the theoretical guarantees are not particularly surprising. Despite the authors’ thorough comparison with prior work, the contribution seems incremental relative to the papers referenced around line 67 of the manuscript. 2. The problem setting is rather restricted, and its practical value is uncertain. The paper provides only simple numerical examples in the appendix, le

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper unifies three increasingly general control/learning regimes with one posterior-sampling–plus–Hedge template. It starts from a finite candidate set $\{f_1,\dots,f_m\}$, where the frequentist policy regret is of order $O((\ln N + \ln m)/\Delta)$, so the dependence on $m$ is logarithmic as in online learning. It then lifts this to an infinite/bounded function class by constructing an $\varepsilon$-packing and obtains regret of the form $N\varepsilon^2 + (\ln N + \ln m(\varepsilon))/\va

Weaknesses

1. The analysis is essentially realizability-based: in all three settings (S1 with a finite set of candidates, S2 with an $\varepsilon$-packed class, and S3 with a parametrized family) the true dynamics $f$ is assumed to lie in the modeling class $F$. In S1, Theorem 2.1 yields a policy-regret bound of order $O((\ln N + \ln m)/\Delta)$, but this relies on a separation margin $\Delta>0$ between the candidate models so that suboptimal ones can be eliminated; when the models are nearly indistinguish

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovation Diffusion and Forecasting

MethodsSparse Evolutionary Training