Zero-shot Model-based Reinforcement Learning using Large Language Models
Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat,, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen, Redko, Bal\'azs K\'egl

TL;DR
This paper explores leveraging large language models for zero-shot model-based reinforcement learning in continuous environments, introducing a novel method called Disentangled In-Context Learning to improve dynamics prediction and uncertainty estimation.
Contribution
It proposes Disentangled In-Context Learning (DICL) to enable LLMs to predict dynamics in continuous MDPs, addressing multivariate data and control signal challenges, with theoretical and experimental validation.
Findings
DICL improves dynamics prediction accuracy.
The approach yields well-calibrated uncertainty estimates.
Demonstrated effectiveness in policy evaluation and data-augmented RL.
Abstract
The emerging zero-shot capabilities of Large Language Models (LLMs) have led to their applications in areas extending well beyond natural language processing tasks. In reinforcement learning, while LLMs have been extensively used in text-based environments, their integration with continuous state spaces remains understudied. In this paper, we investigate how pre-trained LLMs can be leveraged to predict in context the dynamics of continuous Markov decision processes. We identify handling multivariate data and incorporating the control signal as key challenges that limit the potential of LLMs' deployment in this setup and propose Disentangled In-Context Learning (DICL) to address them. We present proof-of-concept applications in two reinforcement learning settings: model-based policy evaluation and data-augmented off-policy reinforcement learning, supported by theoretical analysis of the…
Peer Reviews
Decision·ICLR 2025 Poster
This paper has good originality in that it innovatively proposes a novel DICL method to generalize ICL to continuous-state-space RL. This paper also includes solid mathematical derivations and proofs and detailed experiment results to support the claims in the paper.
There is a lot of room of improvement for the clarity, writing and presentation of this paper. Multiple places in the paper are not very clearly explained and the general theme of the paper is a little bit hard to follow in its writing. For example, throughout the whole paper, there is no explicit explanation or demonstrations on how exactly the LLM prompts for DICL are constructed. One or more concrete prompt examples would be very helpful in the paper to help readers understand the core techni
1. The paper presents a method to integrate state dimension interdependence and action information into in-context trajectories within RL environments, enhancing the applicability of LLMs in continuous state spaces. 2. It provides a theoretical analysis of the policy evaluation algorithm resulting from multi-branch rollouts with LLM-based dynamics models, leading to a novel return bound that enhances understanding in this area. 3. The paper offers empirical evidence supporting the benefits of LL
1. The paper does not extensively discuss how the proposed method generalizes across different environments or tasks, that is, more discussion about the application of this method is needed. 2. While DICL simplifies certain aspects of RL, the integration of actions and the handling of multivariate data present ongoing challenges. More discussion about the introduced aspects of the DICL is needed. 3. The experiments are somewhat simplistic, and it would be worthwhile to conduct more in-depth anal
(please also see the summary) 1. This work systematically studies LLM-based MBRL. 2. The method is clearly presented and easy to follow.
The “zero-shot” claim is potentially misleading. If "zero-shot" is defined at the trajectory level, it is true that no trajectory-level examples were shown to the LLM during prediction. However, as shown in Section 4 (theoretical analysis), it appears necessary to use true dynamics to predict the transition and reward for steps $t < T$. These transitions, such as from $<s_{t-1}, a_{t-1}>$ to $s_t$, effectively serve as state-level few-shot examples. In my understanding, a true “zero-shot” settin
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
