Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning

Jiayu Chen; Le Xu; Aravind Venugopal; Jeff Schneider

arXiv:2505.13709·cs.LG·February 2, 2026

Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning

Jiayu Chen, Le Xu, Aravind Venugopal, Jeff Schneider

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a policy-driven adaptation framework for offline model-based reinforcement learning that enhances robustness and performance by dynamically optimizing the world model and policy together using a maximin approach.

Contribution

It proposes a novel unified learning framework with a maximin optimization for robust offline MBRL, addressing the mismatch and robustness issues of prior methods.

Findings

01

Achieves state-of-the-art results on noisy MuJoCo tasks

02

Demonstrates improved robustness against adversarial noise

03

Provides theoretical analysis supporting the method's effectiveness

Abstract

Offline reinforcement learning (RL) offers a powerful paradigm for data-driven control. Compared to model-free approaches, offline model-based RL (MBRL) explicitly learns a world model from a static dataset and uses it as a surrogate simulator, improving data efficiency and enabling potential generalization beyond the dataset support. However, most existing offline MBRL methods follow a two-stage training procedure: first learning a world model by maximizing the likelihood of the observed transitions, then optimizing a policy to maximize its expected return under the learned model. This objective mismatch results in a world model that is not necessarily optimized for effective policy learning. Moreover, we observe that policies learned via offline MBRL often lack robustness during deployment, and small adversarial noise in the environment can lead to significant performance degradation.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- ROMBRL formulates policy and world model adaptation as a single maximin optimization problem, enabling joint learning and improved robustness compared to traditional two-stage approaches. - Extensive experiments demonstrate that ROMBRL achieves state-of-the-art performance and stability, even when observations are corrupted by Gaussian noise, outperforming existing baselines on noisy MuJoCo and Tokamak tasks. - The approach not only improves average performance but also reduces variance and fa

Weaknesses

Empirically, the key advantage of the proposed method is that it is more robust compared to other approaches when faced with noisy observations, which is an important factor for actual deployment in real world scenarios, since sensor noise will always play a role. However, if I am not mistaken, the authors compare their developed algorithm only to baselines that were not specifically developed for and originally evaluated on this use case. To really show SoTA performance when faced with noisy /

Reviewer 02Rating 6Confidence 4

Strengths

1. The method's robustness is theoretically rigorous by employing a constrained maximin objective and Stackelberg game dynamics. This approach directly yields formal bounds (Theorems 1–3) on the policy's suboptimality gap and the necessary model uncertainty range. 2. The introduction of Fisher Information Matrix approximations for the second-order terms, coupled with the leveraging of the Woodbury Matrix Identity to efficiently compute matrix inverses, makes the second-order gradient computatio

Weaknesses

1. Although the authors compare ROMBRL with several state-of-the-art methods such as EDAC, MOBILE, and other baselines, none of these algorithms are explicitly designed for noisy or perturbed environments. If the main claim is that ROMBRL demonstrates superior robustness under noisy conditions, it would be important to include comparisons with existing robust offline RL methods, such as RORL [1]. RORL explicitly addresses robustness to observation perturbations by employing a simple yet effectiv

Reviewer 03Rating 4Confidence 3

Strengths

The main contribution lies in the idea of adapting the world model jointly with the policy via Stackelberg learning dynamics. This formulation is elegant and potentially generalizable. The theoretical discussion provides a solid foundation for the algorithmic design, and the experimental results demonstrate consistent performance gains over standard baselines, suggesting that the approach improves robustness without severely compromising sample efficiency. The inclusion of both MuJoCo and Tokama

Weaknesses

While the Stackelberg game formulation is compelling, it would be valuable to include an ablation study comparing it directly with a simpler min-max optimization setup, where the world model acts adversarially to reduce the policy objective J. Such an analysis could help clarify the practical advantages of adopting the Stackelberg dynamics. The definition of the uncertainty set via the constraint $KL(P_\bar\phi(⋅|s,a) || P_\phi^k(⋅|s,a)) ≤ \epsilon$ appears to rely on a manually chosen threshol

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning