Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning
Jiayu Chen, Le Xu, Aravind Venugopal, Jeff Schneider

TL;DR
This paper introduces a policy-driven adaptation framework for offline model-based reinforcement learning that enhances robustness and performance by dynamically optimizing the world model and policy together using a maximin approach.
Contribution
It proposes a novel unified learning framework with a maximin optimization for robust offline MBRL, addressing the mismatch and robustness issues of prior methods.
Findings
Achieves state-of-the-art results on noisy MuJoCo tasks
Demonstrates improved robustness against adversarial noise
Provides theoretical analysis supporting the method's effectiveness
Abstract
Offline reinforcement learning (RL) offers a powerful paradigm for data-driven control. Compared to model-free approaches, offline model-based RL (MBRL) explicitly learns a world model from a static dataset and uses it as a surrogate simulator, improving data efficiency and enabling potential generalization beyond the dataset support. However, most existing offline MBRL methods follow a two-stage training procedure: first learning a world model by maximizing the likelihood of the observed transitions, then optimizing a policy to maximize its expected return under the learned model. This objective mismatch results in a world model that is not necessarily optimized for effective policy learning. Moreover, we observe that policies learned via offline MBRL often lack robustness during deployment, and small adversarial noise in the environment can lead to significant performance degradation.…
Peer Reviews
Decision·Submitted to ICLR 2026
- ROMBRL formulates policy and world model adaptation as a single maximin optimization problem, enabling joint learning and improved robustness compared to traditional two-stage approaches. - Extensive experiments demonstrate that ROMBRL achieves state-of-the-art performance and stability, even when observations are corrupted by Gaussian noise, outperforming existing baselines on noisy MuJoCo and Tokamak tasks. - The approach not only improves average performance but also reduces variance and fa
Empirically, the key advantage of the proposed method is that it is more robust compared to other approaches when faced with noisy observations, which is an important factor for actual deployment in real world scenarios, since sensor noise will always play a role. However, if I am not mistaken, the authors compare their developed algorithm only to baselines that were not specifically developed for and originally evaluated on this use case. To really show SoTA performance when faced with noisy /
1. The method's robustness is theoretically rigorous by employing a constrained maximin objective and Stackelberg game dynamics. This approach directly yields formal bounds (Theorems 1–3) on the policy's suboptimality gap and the necessary model uncertainty range. 2. The introduction of Fisher Information Matrix approximations for the second-order terms, coupled with the leveraging of the Woodbury Matrix Identity to efficiently compute matrix inverses, makes the second-order gradient computatio
1. Although the authors compare ROMBRL with several state-of-the-art methods such as EDAC, MOBILE, and other baselines, none of these algorithms are explicitly designed for noisy or perturbed environments. If the main claim is that ROMBRL demonstrates superior robustness under noisy conditions, it would be important to include comparisons with existing robust offline RL methods, such as RORL [1]. RORL explicitly addresses robustness to observation perturbations by employing a simple yet effectiv
The main contribution lies in the idea of adapting the world model jointly with the policy via Stackelberg learning dynamics. This formulation is elegant and potentially generalizable. The theoretical discussion provides a solid foundation for the algorithmic design, and the experimental results demonstrate consistent performance gains over standard baselines, suggesting that the approach improves robustness without severely compromising sample efficiency. The inclusion of both MuJoCo and Tokama
While the Stackelberg game formulation is compelling, it would be valuable to include an ablation study comparing it directly with a simpler min-max optimization setup, where the world model acts adversarially to reduce the policy objective J. Such an analysis could help clarify the practical advantages of adopting the Stackelberg dynamics. The definition of the uncertainty set via the constraint $KL(P_\bar\phi(⋅|s,a) || P_\phi^k(⋅|s,a)) ≤ \epsilon$ appears to rely on a manually chosen threshol
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning
