ICPRL: Acquiring Physical Intuition from Interactive Control
Xinrun Xu, Pi Bu, Ye Wang, B\"orje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Shuo Zhang, Zhiming Ding, Bo Zheng

TL;DR
ICPRL enables vision-language models to acquire physical intuition and adapt policies through interactive, in-context learning without weight updates, improving performance on physics-based tasks in unseen environments.
Contribution
We propose ICPRL, a novel framework combining in-context reinforcement learning with visual models and explicit physical reasoning, allowing adaptive policy learning from interaction histories.
Findings
Significant performance improvements on DeepPHY physics tasks.
Effective adaptation in unseen physical environments.
Policy and world model collaboration enhances decision-making.
Abstract
VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a…
Peer Reviews
Decision·Submitted to ICLR 2026
- The idea of using a world model to enhance the test-time performance of policies is intuitive. - The paper provides extensive evaluations across different model backbones and training algorithms and datasets.
- The novelty of the paper is limited. The core idea of using a world model at test time to enhance policy performance is not new and has been explored in prior work such as [1][2]. - Using text as the backbone for embodied tasks, where actions are naturally continuous, seems contrived. The paper lacks discussion on why VLMs are the best backbone for both the policy and the world model. - The world model is task-specific as it also predicts success/failure of a specific task. What’s the benefi
develops a comprehensive system for learning an VLM on discrete action data The method appears to create improvement in performance with multiple retries The experiemtns are on challenging domains and get decent results
The writing is highly informal and makes it hard to follow exactly what is being done The claims of the paper are not particularly well supported by either the method or the experiments
* Valid and Sound Idea: The paper presents a valid and logical approach. Combining the principles of In-Context Reinforcement Learning (ICRL) with an explicit, learned world model is a sensible strategy for tackling complex, interactive physical reasoning tasks that VLMs currently struggle with. * Effective Paradigm Extension: The framework successfully extends the well-established planning and learning paradigm to the VLM domain. The architecture, which integrates a policy prior, a value/outcom
* Marginal Novelty over Existing Paradigms: The primary weakness is that the framework's architecture is fundamentally a straightforward extension of the MuZero algorithm, adapted for the VLM setup. MuZero also combines a learned policy prior and a world model to guide a Monte Carlo Tree Search (MCTS, of which PUCT is a variant). While applying this to VLMs is a good engineering contribution, the paper does not offer new scientific insight into the underlying principles of planning or model-base
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI
