Grounding Video Models to Actions through Goal Conditioned Exploration
Yunhao Luo, Yilun Du

TL;DR
This paper introduces a method for grounding large video models to continuous actions through self-exploration, enabling agents to learn complex tasks without external supervision or action labels.
Contribution
The paper presents a novel framework that combines trajectory-level action generation with video guidance to directly connect video models to embodied actions without needing labeled data.
Findings
Achieved comparable or better performance than behavior cloning baselines.
Successfully applied to multiple tasks across various environments.
Eliminated the need for external supervision like rewards or action annotations.
Abstract
Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. The unsupervised grounding of video models to actions eliminates the dependency on expensive action annotations, efficiently addressing the problem of mapping video-based observations to actionable policies. 2. The proposed method achieves strong performances across multiple evaluation environments, outperforming supervised methods in quantitative and qualitative resutls. Besides, the method show adaptability across different domains, from robotic manipulation to visual navigations, demonstra
This paper is well-written with strong motivation and comprehensive evluation results. I only have some minor weakness and questions. 1. The reliance on random exploration may not be able to achieve high performance in tasks requiring high precision, such as tasks involving fine-grained manipulation or exact positioning. This approach may struggle to find optimal actions in environments where precise control is crucial, limitations is applications. 2. The proposed method is highly rely on the q
The paper is clearly written and easy to follow, the method proposed achieves state-of-the-art results on several commonly used datasets with consistent improvement. The ablation study does show the effect of each designed module.
One concern about the proposed method lies in the selection of video-guidance. As assumed in this paper, a good enough conditional video generative model is key in improving the goal-conditioned policy in the iterative refining process. This leads to questions on the availability of such models for specific tasks with limited demonstrations. Though the authors mentioned leveraging pre-trained text-to-video models could be discussed in the future, it seems necessary even at the current scope (or
The paper has several strengths including: 1. The work is well-motivated, well-written, and clear 2. The method is unique in proposing video models as a way to enable strong goal conditioning for the policy without needing action labels. 3. The idea of self-learning is unique compared to current RL approaches 4. Good set of ablations performed in the main paper and supplement to analyze the method well 5. The authors sufficiently addressed most of the limitations of their work. Despite these l
Some weaknesses of the paper include: 1. There is a lot of confusion around random action bootstrapping. Mainly, I am confused if it is possible to learn a policy without any successful actions (assuming random action bootstrapping does not result in any successful trajectories). Please see the first 6 points under ‘questions’ to know what needs to be added to the paper to address this weakness. 2. The cost of training this kind of model compared to BC is not mentioned in the limitation section
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Games and Gamification
