ICPRL: Acquiring Physical Intuition from Interactive Control

Xinrun Xu; Pi Bu; Ye Wang; B\"orje F. Karlsson; Ziming Wang; Tengtao Song; Qi Zhu; Jun Song; Shuo Zhang; Zhiming Ding; Bo Zheng

arXiv:2603.13295·cs.LG·March 17, 2026

ICPRL: Acquiring Physical Intuition from Interactive Control

Xinrun Xu, Pi Bu, Ye Wang, B\"orje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Shuo Zhang, Zhiming Ding, Bo Zheng

PDF

Open Access 3 Reviews

TL;DR

ICPRL enables vision-language models to acquire physical intuition and adapt policies through interactive, in-context learning without weight updates, improving performance on physics-based tasks in unseen environments.

Contribution

We propose ICPRL, a novel framework combining in-context reinforcement learning with visual models and explicit physical reasoning, allowing adaptive policy learning from interaction histories.

Findings

01

Significant performance improvements on DeepPHY physics tasks.

02

Effective adaptation in unseen physical environments.

03

Policy and world model collaboration enhances decision-making.

Abstract

VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The idea of using a world model to enhance the test-time performance of policies is intuitive. - The paper provides extensive evaluations across different model backbones and training algorithms and datasets.

Weaknesses

- The novelty of the paper is limited. The core idea of using a world model at test time to enhance policy performance is not new and has been explored in prior work such as [1][2]. - Using text as the backbone for embodied tasks, where actions are naturally continuous, seems contrived. The paper lacks discussion on why VLMs are the best backbone for both the policy and the world model. - The world model is task-specific as it also predicts success/failure of a specific task. What’s the benefi

Reviewer 02Rating 6Confidence 3

Strengths

develops a comprehensive system for learning an VLM on discrete action data The method appears to create improvement in performance with multiple retries The experiemtns are on challenging domains and get decent results

Weaknesses

The writing is highly informal and makes it hard to follow exactly what is being done The claims of the paper are not particularly well supported by either the method or the experiments

Reviewer 03Rating 4Confidence 4

Strengths

* Valid and Sound Idea: The paper presents a valid and logical approach. Combining the principles of In-Context Reinforcement Learning (ICRL) with an explicit, learned world model is a sensible strategy for tackling complex, interactive physical reasoning tasks that VLMs currently struggle with. * Effective Paradigm Extension: The framework successfully extends the well-established planning and learning paradigm to the VLM domain. The architecture, which integrates a policy prior, a value/outcom

Weaknesses

* Marginal Novelty over Existing Paradigms: The primary weakness is that the framework's architecture is fundamentally a straightforward extension of the MuZero algorithm, adapted for the VLM setup. MuZero also combines a learned policy prior and a world model to guide a Monte Carlo Tree Search (MCTS, of which PUCT is a variant). While applying this to VLMs is a good engineering contribution, the paper does not offer new scientific insight into the underlying principles of planning or model-base

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI