Statistical Guarantees for Offline Domain Randomization
Arnaud Fickinger, Abderrahim Bendahi, Stuart Russell

TL;DR
This paper provides the first statistical guarantees for offline domain randomization in reinforcement learning, showing conditions under which the method reliably estimates simulator parameters from offline data.
Contribution
It formulates offline domain randomization as a maximum-likelihood estimation problem and proves its consistency under mild assumptions, establishing a theoretical foundation for the approach.
Findings
Estimator converges in probability to true dynamics with enough data.
Estimator converges almost surely under Lipschitz continuity assumptions.
Outlines relaxations of assumptions to broaden applicability.
Abstract
Reinforcement-learning (RL) agents often struggle when deployed from simulation to the real-world. A dominant strategy for reducing the sim-to-real gap is domain randomization (DR) which trains the policy across many simulators produced by sampling dynamics parameters, but standard DR ignores offline data already available from the real system. We study offline domain randomization (ODR), which first fits a distribution over simulator parameters to an offline dataset. While a growing body of empirical work reports substantial gains with algorithms such as DROPO, the theoretical foundations of ODR remain largely unexplored. In this work, we cast ODR as a maximum-likelihood estimation over a parametric simulator family and provide statistical guarantees: under mild regularity and identifiability conditions, the estimator is weakly consistent (it converges in probability to the true…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper tackles a relevant framework in the field of sim-to-real transfer (e.g. ODR) that gained traction in recent years but lacked a thorough theoretical understanding. Such setting opens applications of domain randomization that are arguably safer and more sample efficient than uniform domain randomization methods.
- Restricting Gaussian assumption: recent empirical works further extend the ODR framework by considering normalizing flows or neural density estimators over dynamics parameters [1]. It's unclear how much the presented analysis is restricted to (1) simulators that follow a Gaussian parameter distribution and to (2) transition functions that are also assumed to be Gaussian. [1] Muratore, Fabio, et al. "Neural posterior domain randomization." Conference on robot learning. PMLR, 2022.
1. By framing ODR as a maximum-likelihood estimation (MLE) problem, the paper elevates it from a purely empirical heuristic to a method with formal statistical grounding, establishing properties such as consistency. 2. The paper provides a clear exposition of its underlying assumptions, such as i.i.d. sampling, mixture positivity, and Lipschitz continuity, and thoughtfully discusses possible relaxations, which helps clarify the scope and applicability of the theoretical results.
1. The theoretical framework assumes that the true environment dynamics $𝑀^∗$ lie within a known parameterized simulator family {$𝑀_𝜉$}, and that a representative dataset of real-world transitions is available. In practice, however, the true parameterization is unknown, and it is rarely possible to guarantee that the simulator family adequately captures real-world behavior. This makes the theory elegant but largely non-operational in realistic settings. 2. The proposed ODR framework relies on
The paper represents one of the first attempts to formally establish statistical guarantees for ODR, an area that has previously relied primarily on empirical evidence (e.g., algorithms such as DROPO). The theoretical treatment is rigorous yet well-motivated, and the authors are careful to analyze the realism of their assumptions, providing insightful discussions on how they could be relaxed to cover broader scenarios. This combination of solid mathematical grounding and practical reflection sig
The analysis assumes that environment parameters are predefined, but in practice, it may be more realistic to start with a broader set of perturbable parameters and iteratively remove those with small variance as data accumulates. It would be helpful to discuss whether the current proofs would still hold, or require modification, under such an adaptive parameter-selection procedure. While the theoretical contribution stands well on its own, the paper could be strengthened by adding a few illus
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Advanced Multi-Objective Optimization Algorithms
