ICLR: In-Context Imitation Learning with Visual Reasoning
Toan Nguyen, Weiduo Yuan, Songlin Wei, Hui Li, Daniel Seita, Yue Wang

TL;DR
ICLR introduces a novel in-context imitation learning framework that incorporates visual reasoning traces to improve robot task adaptation, success rates, and generalization in complex scenarios.
Contribution
The paper presents a unified transformer-based approach that jointly learns action prediction and visual reasoning traces, enhancing robotic imitation learning capabilities.
Findings
Improved success rates in manipulation tasks.
Enhanced generalization to unseen tasks and objects.
Effective integration of visual reasoning in imitation learning.
Abstract
In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
