Learning Massively Multitask World Models for Continuous Control
Nicklas Hansen, Hao Su, Xiaolong Wang

TL;DR
This paper introduces Newt, a multitask world model trained on hundreds of tasks with online interaction, demonstrating improved performance, data efficiency, and adaptability in continuous control tasks.
Contribution
It presents a novel approach combining large-scale pretraining and online learning for multitask control, along with a new benchmark of 200 diverse tasks.
Findings
Newt outperforms strong baselines in multitask settings.
It exhibits strong open-loop control capabilities.
Enables rapid adaptation to unseen tasks.
Abstract
General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks.…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper proposes a benchmark which integrates domains that are popularly studied in the RL community, releasing single-task checkpoints and dataset which is significant not only for multi-task RL but also offline, O2O and continuous RL research. 2. The empirical results show advantages over baselines on ManiSkill and DMControl. 3. Model info such as training time, model architecture is detailed presented. 4. The figures are well drawn and easy to understand.
1. There is no preliminary description so the problem setting confuses me at the beginning. In the multi-task RL setting a task label $n$ should be added to the original $(s, a, s', r)$ but in Line.275 there is only $(s, a, r)$. 2. Newt only shows performance boost over baselines in DMC and Maniskill out of all 10 domains. In Meta-World, MuJoCo, Box2D, Robodesk and Atari it's just on par with FastTD3, while in OGBench and MiniArcade it's on par with behavior cloning. 3. Selected baselines ar
The paper has several notable merits: - MMBench provides a unified framework across 10 heterogeneous domains with consistent data handling and language-conditioned tasks. - The paper introduces reasonable design choices such as discrete regression for reward/value prediction and per-task discount factors, with comprehensive ablations supporting their impact. - The paper is well-organized, figures effectively illustrate key results, and open-sourced resources (200+ checkpoints, 4000+ demos) signi
### 1. Novelty concerns in core contributions While MMBench contains 200 tasks, most of them are directly inherited from existing benchmarks (e.g., DMControl, Meta-World). This limits its novelty compared to benchmarks such as ManiSkill3, which introduces fundamentally new task paradigms. Similarly, Newt—although incorporating CLIP/DINOv2 encoders and demonstration conditioning—builds incrementally upon TD-MPC2, without a clear paradigm shift. ### 2. Missing analysis of task scalability The p
The paper presents good results on continous control benchmarks which is still an interesting problem. Especially training one policy over this big variety of tasks is interesting to see. The algorithm is presented clearly and is easy to follow, and the model-based MPC aspect of it is interesting. I really appreciate the effort of the authors of making the code and the checkpoints accessible, this makes it possible to reproduce the results and build on-top of them.
My major concern is the applicability of this to real-world continuous control problems. While in simulation the results look good, it requires over 100M steps to train this policy which would be unfeasable on a real-world application. I also think the paper would benefit from ablating the usefulness of the different components - specifically interesting would be to understand how useful is the mpc planning is, how much learning of the world model helps performance as well as how much does the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Robot Manipulation and Learning
