Learning to Act from Actionless Videos through Dense Correspondences
Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, Joshua B. Tenenbaum

TL;DR
This paper introduces a method to learn robot policies from videos without action labels by synthesizing videos and using dense correspondences, enabling cross-robot and cross-environment generalization.
Contribution
It presents a novel approach that constructs robot policies solely from RGB videos and text, eliminating the need for action annotations and enabling broad applicability.
Findings
Effective policy learning from videos without action labels
Successful deployment across diverse robotic tasks
Open-source framework for fast high-fidelity video modeling
Abstract
In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that ``hallucinate'' robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and…
Peer Reviews
Decision·ICLR 2024 spotlight
- The paper tackles an important problem of learning from action-free videos - The method, to my knowledge is novel - The approach significantly outperforms baselines on many different tasks - The ablations are well analyzed - The paper is easy to follow and well written
- I think one of the main limitations is the setting: AVDC needs videos of robots performing the task. I believe this is a contrived setting as it is very likely that if video + 3D information is available, then this was a robot demonstration, and one can just collect action data. To me, it is unclear how this approach will scale beyond robot data. - I am concerned by the reported results for the BC baseline. Due to action data being available, as well as the robot data being in-domain for the
* The general problem of making use of actionless human video data is of interest and importance to the research community. * The problem is well-motivated and the literature review does a good job of contextualizing the paper in prior work. * The paper is strong, well-written and easy to follow. * The use of geometry to reconstruct the transformation of the predicted objects (stationary camera) or embodiment (moving camera) which can be derived simultaneously from the optical flow and depth cam
* The literature review is missing a number of relevant works. * V-PTR: similar high-level motivation of using video-based, prediction-focused pre-training and then action-based finetuning. This should have likely served as a baseline for the proposed method. * [A] Bhateja, Chethan, et al. "Robotic Offline RL from Internet Videos via Value-Function Pre-Training." arXiv preprint arXiv:2309.13041 (2023). * Diffusion policy: diffusion policy has shown very good results in terms of multi-task
1. The paper proposed a new correspondence based method to obtain robot action in forecasted robot videos. It proves that a latent dynamic model is not needed if the forecasted video has good quality. 2. The authors proposed a new method to generate future videos using a diffusion model, which achieves efficient training. It provides a promising toolbox for the community. 3. The method is evaluated on two tasks, table-top manipulaion and in-door navigation, demonstrating its effectiveness in dif
1. The selected robot tasks are relatively toy, and the potential of such kind of video prediction method is not evaluated. However, this is not the weakness of this paper, but a common practice for video prediction based robot control.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Reinforcement Learning in Robotics
