OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation

Raktim Gautam Goswami; Prashanth Krishnamurthy; Yann LeCun; Farshad Khorrami

arXiv:2505.20425·cs.RO·January 1, 2026

OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

PDF

Open Access

TL;DR

This paper introduces OSVI-WM, a novel one-shot visual imitation learning framework that uses a world model to generate trajectories, enabling robots to perform unseen tasks with higher success rates across simulated and real-world environments.

Contribution

The paper presents a world-model-guided trajectory generation approach for one-shot visual imitation, improving generalization to unseen tasks and outperforming prior methods.

Findings

01

Over 30% success rate improvement on benchmarks

02

Effective in both simulated and real-world robotic tasks

03

Outperforms existing approaches in generalization to new tasks

Abstract

Visual imitation learning enables robotic agents to acquire skills by observing expert demonstration videos. In the one-shot setting, the agent generates a policy after observing a single expert demonstration without additional fine-tuning. Existing approaches typically train and evaluate on the same set of tasks, varying only object configurations, and struggle to generalize to unseen tasks with different semantic or structural requirements. While some recent methods attempt to address this, they exhibit low success rates on hard test tasks that, despite being visually similar to some training tasks, differ in context and require distinct responses. Additionally, most existing methods lack an explicit model of environment dynamics, limiting their ability to reason about future states. To address these limitations, we propose a novel framework for one-shot visual imitation learning via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Multimodal Machine Learning Applications