Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Jie Cheng; Ruixi Qiao; Yingwei Ma; Binhua Li; Gang Xiong; Qinghai Miao; Yongbin Li; Yisheng Lv

arXiv:2410.00564·cs.LG·January 30, 2026

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Jie Cheng, Ruixi Qiao, Yingwei Ma, Binhua Li, Gang Xiong, Qinghai Miao, Yongbin Li, Yisheng Lv

PDF

Open Access 1 Repo 3 Reviews

TL;DR

JOWA introduces a jointly-optimized world-action model pretrained on diverse Atari data, achieving high performance and strong generalization in offline RL through a shared transformer backbone and efficient planning.

Contribution

The paper presents JOWA, a novel offline model-based RL approach that jointly trains a world-action model, enabling scalable, generalizable decision-making from large heterogeneous datasets.

Findings

01

Achieves 78.9% human-level performance on pretrained Atari games with only 10% data.

02

Outperforms state-of-the-art offline RL baselines by 31.6% on average.

03

Demonstrates effective transfer to new games with minimal fine-tuning data.

Abstract

A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

1. Scalability: JOWA’s design showcases robust scaling potential, as performance improves with model size without the usual TD-learning instability issues. 2. Detailed Ablation Studies: The authors conducted extensive ablations, examining the impact of core design elements such as task embeddings, training losses, and synthetic data usage.

Weaknesses

1. Missing related work and explanations for the architecture of the world-action modeling. The work proposes to use VQ-VAE for the representation learning in Atari games, while there is a work named Forward-Inverse Cycle Consistency (FICC), which uses VQ-VAE for the offline Atari dataset to learn representations and action embeddings. It seems this pipeline is so similar to the FICC, however, the author didn't mention the difference between JOWA and FICC. (Both use offline Atari datasets for mo

Reviewer 02Rating 8Confidence 4

Strengths

- The proposed JOWA outperforms existing SOTA methods by a large margin. The perform scales up with model sizes - The joint optimization of world and action models stabilizes large-scale multi-task offline RL training. - The ablation studies comprehensively study the key design choices of the proposed methods.

Weaknesses

- The proposed method combines the best offline RL training techniques, leveraging the world modeling loss to stabilize Q-value learning. The empirical performance is impressive. However, the technical novelty is thus limited. - By taking a closer look at Table 2, we can see that the 150M variant does not consistently outperform the 40M and 70M variants on all tasks. For example, the 40M variant achieves the highest score on Centipede, while the 70M variant excels on NameThisGame, SpaceInvaders

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper is clearly written and easy to follow. 2. This paper presents sufficient experimental results to demonstrate the validity of its proposed method.

Weaknesses

1. A large generalist TD-MPC2 agent is capable of performing a variety of tasks across multiple domains. I wonder if the proposed method is better than TD-MPC2 in the offline setup. 2. Extending the experiments beyond Atari to more complex environments like Kitchen or Meta-World would offer stronger validation of the proposed method's effectiveness.

Code & Models

Repositories

cjreinforce/jowa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-time simulation and control systems · Hydraulic and Pneumatic Systems · Software Testing and Debugging Techniques