Sample-efficient and Scalable Exploration in Continuous-Time RL

Klemens Iten; Lenart Treven; Bhavya Sukhija; Florian D\"orfler; Andreas Krause

arXiv:2510.24482·cs.LG·March 3, 2026

Sample-efficient and Scalable Exploration in Continuous-Time RL

Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian D\"orfler, Andreas Krause

PDF

3 Reviews

TL;DR

This paper introduces COMBRL, a scalable and sample-efficient continuous-time reinforcement learning algorithm that leverages probabilistic models to handle unknown nonlinear dynamics, achieving superior performance and theoretical guarantees.

Contribution

We propose COMBRL, a novel continuous-time RL algorithm that combines probabilistic modeling with uncertainty-aware exploration, providing theoretical regret bounds and improved empirical results.

Findings

01

COMBRRL achieves sublinear regret in reward-driven settings.

02

The algorithm demonstrates superior sample efficiency over prior methods.

03

It outperforms baselines across various deep RL tasks.

Abstract

Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

- Continuous-time model-based RL is an under-explored area with practical relevance, e.g., in robotic control. - The paper includes extensive experiments evaluating design choices and hyperparameter sensitivity across multiple benchmarks. - Theoretical guarantees are provided.

Weaknesses

- In Figure 3, COMBRL’s performance on the primary task is modest: in 5 out of 7 tasks, its error bars overlap substantially with those of the Mean Planning baseline, raising questions about practical gains. - The paper assumes familiarity with several methods (e.g., SAC, iCEM, and TaCoS) without brief introductions, which hinders accessibility. - The novelty is unclear. While continuous-time RL is valuable, the core idea—-balancing reward maximization and epistemic uncertainty in the objective

Reviewer 02Rating 4Confidence 3

Strengths

The theoretical studies are solid.

Weaknesses

The primary weakness is the numerical evaluation. When ground-truth values are available, experiments should report coverage rates of the true value; limiting results to the averaged returns and standard deviations does not provide an adequate assessment.

Reviewer 03Rating 6Confidence 2

Strengths

- I am not an expert in this area but I agree that the study of the continuous-time RL is promising. In addition, this work further explores the unsupervised setting, which is important and interesting. - The reward-plus-uncertainty objective makes sense and the results look good.

Weaknesses

1. My main concern is that the continuous-time RL is already studied in previous work such as [1], in which the system is also modelled by ODEs and the model is built on GP. This makes the novelty incremental. 2. In Figure 2, the baseline approaches are only mean planner and PETS. It would be better to provide more comparisons with other state-of-the-art methods on these GYM and DMC tasks. [1] Efficient exploration in continuous-time model-based reinforcement learning, NeurIPS 2023.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.