Scalable In-Context Q-Learning
Jinmei Liu, Fuhong Liu, Zhenhong Sun, Jianye Hao, Huaxiong Li, Bo Wang, Daoyi Dong, Chunlin Chen, Zhi Wang

TL;DR
This paper introduces S-ICQL, a scalable in-context Q-learning framework that combines dynamic programming and world modeling with a prompt-based transformer architecture to improve decision-making in complex environments.
Contribution
It proposes a novel, scalable in-context reinforcement learning method that integrates world modeling and dynamic programming with a transformer architecture for efficient policy learning.
Findings
Consistent performance improvements over baselines.
Effective learning from suboptimal trajectories.
Fast and precise in-context inference enabled by world models.
Abstract
Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{S-ICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using…
Peer Reviews
Decision·ICLR 2026 Poster
- The manuscript is well-written and easy-to-follow. - The experiments are comprehensive, including an array of strong baselines. - The performance of S-ICQL is strong.
My biggest concern is about the novelty of the proposed framework S-ICQL. While it has many components, all of its components and insights have appeared in prior works. As a consequence, S-ICQL feels like a unification of ideas from prior works. Specifically, there are two core technical contributions, - **Section 3.3.** The world modeling is **identical** to [1]. Both the model design and pretraining process are almost identical to the task-embedding model of [1]. - **Section 3.4.** S-ICQL al
- Good experimental results which consider a number of relevant baselines - Ablations show that each of the components contributes meaningfully to performance - Stitching section shows improvements over best returns in the dataset
- My main issue is that it is very hard to understand the details of the method. Figure 1 is very complicated; there are many arrows going in different directions and it is not obvious what exactly “precise task representation” and “precise lightweight prompt” are. - The method is also quite complicated and not particularly novel. It seems to be a combination of a number of existing components (e.g., AWR, world model, transformer-based policies)
1. The proposed world model effectively captures task-relevant information, providing a more compact and informative representation compared to previous ICRL approaches. 2. The integration of Q-function learning and advantage-weighted regression (AWR) allows the model to optimize its policy from suboptimal data and obtain better-performing behaviors. 3. The experimental results are relatively comprehensive, with comparisons to multiple baselines demonstrating consistent improvement
1. The experimental environments are relatively simple. It would strengthen the work to include results on more challenging or diverse environments.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology
