SOReL and TOReL: Two Methods for Fully Offline Reinforcement Learning
Mattie Fellows, Clarisse Wibault, Uljad Berdica, Johannes Forkel, Michael A. Osborne, Jakob N. Foerster

TL;DR
This paper introduces SOReL and TOReL, two offline reinforcement learning algorithms that improve safety, reliability, and hyperparameter tuning without online interactions, advancing RL's real-world applicability.
Contribution
The paper presents novel Bayesian offline RL algorithms, SOReL for safe performance estimation and TOReL for offline hyperparameter tuning, reducing reliance on online data.
Findings
SOReL accurately estimates regret using offline data.
TOReL achieves competitive hyperparameter tuning performance offline.
Both methods enhance safety and reliability in offline RL applications.
Abstract
Sample efficiency remains a major obstacle for real world adoption of reinforcement learning (RL): success has been limited to settings where simulators provide access to essentially unlimited environment interactions, which in reality are typically costly or dangerous to obtain. Offline RL in principle offers a solution by exploiting offline data to learn a near-optimal policy before deployment. In practice, however, current offline RL methods rely on extensive online interactions for hyperparameter tuning, and have no reliable bound on their initial online performance. To address these two issues, we introduce two algorithms. Firstly, SOReL: an algorithm for safe offline reinforcement learning. Using only offline data, our Bayesian approach infers a posterior over environment dynamics to obtain a reliable estimate of the online performance via the posterior predictive uncertainty.…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses important problems that plague the design of offline RL algorithms and these issues limit the applicability of offline RL in general. - The empirical results appear pretty compelling as well.
- The paper can do a better job with presenting connections with lowerbounds in offline RL connecting the regret estimation against hardness of policy evaluation; for instance, see [1], but there are probably other works that build on this. Why does the proposed procedure actually work in light of some of these stark negative results? [1] Wang et al: What are the Statistical Limits of Offline RL with Linear Function Approximation? 2020.
**Well-motivated problem formulation.** The paper identifies and articulates two concrete, practically important issues in offline RL that have been underexplored: the reliance on online interactions for hyperparameter tuning and the lack of performance guarantees before deployment. The motivating examples (healthcare, robotics) and the cycle diagram (Figure 1) effectively communicate why these issues matter for real-world applications. This clear problem framing strengthens the contribution's s
**Gap between theory and practice.** The theoretical regret bound (Theorem 1) is noted to be "too conservative" and is not used in practice; instead, the posterior predictive median (Eq. 5) is employed as a heuristic. This disconnect undermines the theoretical contribution's direct utility. The paper acknowledges that the bound's tightness depends critically on model accuracy relative to discount factor \gamma, a constraint difficult to satisfy in practice (e.g., only pendulum-v1 yields non-tri
- the theoretical considerations & derivation of the algorithms, leveraging PIL to select hyperparameters for the model appear novel and significant - the method appears to be able to well approximate the true regret, enabling fully offline hyperparameter tuning and thus bringing offline RL closer to real world applicability - the method is very flexible and can be used in principle with any existing offline RL algorithm of a users choice - the authors demonstrate accurate regret estimation on a
While I find the proposed method highly appealing in many ways, I find the paper has a couple of major weaknesses: 1) A key contribution appears to be the PIL-based tuning of the model hyperparameters. I would argue however, that this is the "easy" part of offline RL hyperparameter tuning, since we can resort to simple supervised learning techniques like holding out 10% of the training data & measuring prediction performance on this set so select hyperparameters. I would argue the proposed meth
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElevator Systems and Control · Reinforcement Learning in Robotics
