Contextual Latent World Models for Offline Meta Reinforcement Learning

Mohammadreza Nakheai; Aidan Scannell; Kevin Luck; Joni Pajarinen

arXiv:2603.02935·cs.LG·March 4, 2026

Contextual Latent World Models for Offline Meta Reinforcement Learning

Mohammadreza Nakheai, Aidan Scannell, Kevin Luck, Joni Pajarinen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces contextual latent world models for offline meta-reinforcement learning, which jointly learn task representations and dynamics to improve generalization across unseen tasks.

Contribution

It proposes a novel method that conditions latent world models on inferred task representations, enhancing task-dependent dynamics capture and generalization.

Findings

01

Improved task representation quality.

02

Enhanced generalization to unseen tasks.

03

Significant performance gains on multiple benchmarks.

Abstract

Offline meta-reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context-based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self-supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task-conditioned temporal consistency, yielding task representations that capture task-dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1.The paper is clearly written and easy to follow. 2.The idea of extending DCWM to a conditioned version for generalizing across different tasks, and jointly training the model to obtain better task representations, is novel and well-motivated. 3.The authors have conducted comprehensive experiments and ablation studies to verify and analyze the effectiveness of C-DCWM.

Weaknesses

1. While the current experimental validation is primarily conducted on continuous control tasks, these environments typically feature consistent dynamics within each task, which likely simplifies the learning of a discrete codebook. To further substantiate the generalizability of the proposed method, it would be compelling to evaluate it on more heterogeneous domains, such as the Atari benchmark. In these environments, a single task (or game) often comprises distinct levels requiring diverse pol

Reviewer 02Rating 4Confidence 3

Strengths

1. The integration of context-conditioned latent world models is clearly implemented and experimentally well-controlled. 2. Empirical performance across multiple offline meta-RL benchmarks is consistent, showing that classification-based temporal consistency can outperform standard regression losses.

Weaknesses

1. Limited novelty. The approach mainly combines existing ingredients, e.g., context encoders from meta-RL, discrete latent world models from Dreamer/TD-MPC, and contrastive representation learning, into a joint training scheme. While the integration is clean, the conceptual advance over prior work such as CSRO, UNICORN, and discrete-latent Dreamer variants is marginal. 2. Overreliance on IQL head. The offline RL component is fixed to IQL; no evidence is provided that the representation benefi

Reviewer 03Rating 2Confidence 4

Strengths

1. The study of world models capable of generalizing across multiple tasks is of significant practical importance, as it promises to substantially advance the real-world application of reinforcement learning. 2. The authors provide extensive experiments across multiple benchmarks (MuJoCo, Contextual-DMC, and Meta-World) , demonstrating the method's strong empirical performance against several baselines.

Weaknesses

1. The proposed method appears to be a relatively straightforward extension of the DCWM frameworkto the multi-task setting by conditioning it on a task representation. This potentially limits the technical novelty of this work. 2. The evaluation of generalization is limited. The out-of-distribution experiments primarily focus on parametric variations within the same task families. A more challenging and practically relevant OOD evaluation, such as generalization to unseen tasks in Meta-World, is

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning