Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting
Zhongjian Qiao, Jiafei Lyu, Boxiang Lyu, Yao Shu, Siyang Gao, Shuang Qiu

TL;DR
This paper introduces ROMI, a robust model-based offline RL method that improves stability and performance by adaptively balancing model conservatism and value-awareness through implicit differentiation.
Contribution
ROMI proposes a novel robust value-aware model learning approach with adaptive weighting, addressing instability and conservatism issues in existing adversarial model learning methods.
Findings
ROMI outperforms RAMBO on D4RL and NeoRL datasets.
ROMI achieves competitive or superior results compared to state-of-the-art methods.
ROMI demonstrates stable model updates and effective out-of-distribution generalization.
Abstract
Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, \textit{model exploitation} could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citep{rigter2022rambo} has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose \textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly…
Peer Reviews
Decision·ICLR 2026 Poster
Interesting finding in RAMBO, seemingly strong results, includes ablations, and an interesting method of balancing competing objectives that I have not come across in RL before.
**W1.** There are lots of moving parts. It seems significantly more complicated to implement and understand than RAMBO. **W2.** The authors say they compare all methods at 1M steps for fairness. However, since some baselines like MOBILE were tuned for 3M steps, it would be important to see if ROMI's performance holds up relative to MOBILE etc when trained for the full 3M steps. **W3.** Larger compute requirement. **W4.** They solve the problem of tuning lambda, but introduce more new hyperpar
- Proposes a novel, robust approach that addresses instability and over-conservatism in model-based offline RL. The bi-level optimization scheme appears a promising solution for adaptive conservatism, improving both stability and performance - Demonstrate strong empirical results on MuJoCo datasets, outperforming or matching state-of-the-art baselines - Proposed framework appears to be less sensitive to hyperparameters and more generalizable across tasks, as shown e.g. in Fig. 3
- Even though robustness has clearly gone up, the approach may still require careful tuning of certain parameters - i.e. in Fig. 3 we see that while 0.01-1 are close together, setting the parameter to 10 yields very unstable results. Since all evaluations are in a relatively similar task domain (MuJoCo hopper, walker and halfcheetah), it is unclear whether the range of stable parameters is representable for other tasks as well, i.e. for other environments tuning the parameter may again become an
While the overall method is slightly complex, the paper presents the motivation and contribution of the work clearly. The proposed integration of conservative value-aware model learning into offline model-based RL is reasonable and shows some promising results. In addition, known issues about the generalization abilities of value-aware model learning are appropriately avoided with the bi-level optimization scheme.
I genuinely do not have any major issues with the paper, so the following are excessively nitpicky and mostly comments on the writing. I believe the writing of the paper could be strengthened slightly by defining the method less in contrast to RAMBO, as I believe the method conceptually stands on it's own. Line 175: I don't believe this is a proper example of an adversarial loss, as there is no min/max formulation with competing networks. It would be polite to cite Farahmand for coining the t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
