PWM: Policy Learning with Multi-Task World Models

Ignat Georgiev; Varun Giridhar; Nicklas Hansen; Animesh Garg

arXiv:2407.02466·cs.LG·February 25, 2025

PWM: Policy Learning with Multi-Task World Models

Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg

PDF

Open Access 3 Reviews

TL;DR

PWM introduces a model-based reinforcement learning algorithm that leverages regularized world models to enable fast, efficient policy learning across multiple tasks with high-dimensional action spaces, outperforming existing methods.

Contribution

It presents a novel approach combining offline pre-training and first-order optimization on regularized world models for multi-task continuous control.

Findings

01

PWM solves tasks with up to 152 action dimensions.

02

PWM outperforms methods using ground-truth dynamics.

03

PWM scales to 80 tasks with 27% higher rewards.

Abstract

Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World model methods offer scalability by learning a simulation of the environment but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. This paper is clearly written and easy to follow. 2. This paper presents sufficient experimental results to demonstrate the validity of its proposed method.

Weaknesses

1. My primary concern revolves around the novelty of the proposed approach. This study appears to amalgamate TD-MPC2 and SHAC methodologies. To elaborate, a multi-task world model incorporating a task embedding and SimNorm activation mirrors aspects of TD-MPC2 for representation learning. Subsequently, during the policy learning phase, policies are acquired via first-order optimization akin to SHAC. Consequently, the original contribution of this work may seem somewhat limited. 2. In Figure 13,

Reviewer 02Rating 6Confidence 3

Strengths

- Good baseline comparisons with impressive results for PWM. The comparison to a method that uses ground truth dynamics gradients (from the differentiable simulator) is valuable and compelling. - Good ablation experiments investigating sensitivity to the environment dynamics and the degree of world model regularization.

Weaknesses

- A diagram of the model and policy training would be helpful (i.e. more detailed than Figure 1). Perhaps specifically illustrating what is different about this vs. TD-MPC2. - I find Figure 2 confusing: What is the y-axis of the middle plot? What does a negative value mean? Is theta the angle (perhaps confusing because it is overloading the use of the theta symbol)? Additionally, I’m a little confused about how this example relates to a broader phenomenon of contact-induced discontinuities: th

Reviewer 03Rating 6Confidence 4

Strengths

1. The key idea of the paper is well delivered using pedagogical examples and is easy to follow 2. The results are highly reproducible with detailed experimental settings and code included, which also benefits the community for further study. 3. The experiments conducted in this paper are quite comprehensive and solid.

Weaknesses

1. Typos and incorrect format - Incorrect citing format in chapter 4.2: when the authors or the publication are included in the sentence, the citation should not be in parenthesis, please use \citet{} instead. - Typos in Equation (2): it should be $\gamma^{h-t+1}$ instead of $\gamma^h$ - Typos in Algorithm 1: "$\nabla$" is omitted in all equations regarding learnable parameters update. - Typos in Figure 10: missing ")" in the caption. 2. Lack of novelty - The key idea of policy optimization via

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning