The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective
Michael Muehlebach, Zhiyu He, Michael I. Jordan

TL;DR
This paper analyzes the sample complexity of online reinforcement learning for nonlinear dynamical systems with continuous spaces, providing regret bounds for various model classes and highlighting practical algorithm features.
Contribution
It introduces a unified analysis of sample complexity for diverse nonlinear dynamical models, extending previous results to more general settings.
Findings
Policy regret of O(N ε^2 + d_u ln(m(ε))/ε^2) in general models
Regret of O(√(d_u N p)) for parametrized models like neural networks
Algorithms are simple, incorporate prior knowledge, and have benign transients
Abstract
We study the sample complexity of online reinforcement learning in the general \hzyrev{non-episodic} setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of , where is the time horizon, is a user-specified discretization width, the input dimension, and measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued…
Peer Reviews
Decision·ICLR 2026 Poster
The paper studies non-episodic RL for a class of nonlinear dynamical systems that is more general than lots of related work. This problem is clearly on the frontier of RL theory. The proposed techniques use a nice mixture of online learning theory, Bayesian methods, and nonlinear control theory. The paper should be of interest to researchers with both RL and control backgrounds. The main assumptions (besides ignoring computational cost) are Assumption 1, which is related to Bellman optimality a
In the Theorem 1 statement, it is a bit confusing to see the equation (1) suddenly called "$\mathcal{H}_2$ gain", maybe it is equivalent to the classic $\mathcal{H}_2$ gain for linear dynamics and $l(\cdot,\cdot)$ quadratic, but this version is still unfamiliar and RL audiences definitely won't know it. In general, the paper seems to assume a level of familiarity with classic control theory that the ICLR audience may not possess; it would improve the paper to do a bit more hand-holding. Recover
1. The discussion of related work is exceptionally clear, and the citations appear comprehensive. 2. The theoretical analysis is rigorous. 3. The paper is well organized, and the narrative progresses with a coherent, reader-friendly logic.
While I do not see any glaring flaws, the following points prevent a stronger recommendation: 1. Under the stated assumptions, the theoretical guarantees are not particularly surprising. Despite the authors’ thorough comparison with prior work, the contribution seems incremental relative to the papers referenced around line 67 of the manuscript. 2. The problem setting is rather restricted, and its practical value is uncertain. The paper provides only simple numerical examples in the appendix, le
1. The paper unifies three increasingly general control/learning regimes with one posterior-sampling–plus–Hedge template. It starts from a finite candidate set $\{f_1,\dots,f_m\}$, where the frequentist policy regret is of order $O((\ln N + \ln m)/\Delta)$, so the dependence on $m$ is logarithmic as in online learning. It then lifts this to an infinite/bounded function class by constructing an $\varepsilon$-packing and obtains regret of the form $N\varepsilon^2 + (\ln N + \ln m(\varepsilon))/\va
1. The analysis is essentially realizability-based: in all three settings (S1 with a finite set of candidates, S2 with an $\varepsilon$-packed class, and S3 with a parametrized family) the true dynamics $f$ is assumed to lie in the modeling class $F$. In S1, Theorem 2.1 yields a policy-regret bound of order $O((\ln N + \ln m)/\Delta)$, but this relies on a separation margin $\Delta>0$ between the candidate models so that suboptimal ones can be eliminated; when the models are nearly indistinguish
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovation Diffusion and Forecasting
MethodsSparse Evolutionary Training
