Videos are Sample-Efficient Supervisions: Behavior Cloning from Videos via Latent Representations
Xin Liu, Haoran Li, Dongbin Zhao

TL;DR
This paper introduces BCV-LR, a novel unsupervised framework that enables highly sample-efficient imitation learning from videos by extracting and aligning latent action representations, outperforming existing methods in visual control tasks.
Contribution
The paper presents the first method to achieve sample-efficient visual policy learning directly from videos without any additional supervision, using latent representations and iterative policy refinement.
Findings
Outperforms state-of-the-art ILV and RL methods in sample efficiency.
Enables expert-level performance on some visual control tasks.
Demonstrates that videos alone can support highly efficient policy learning.
Abstract
Humans can efficiently extract knowledge and learn skills from the videos within only a few trials and errors. However, it poses a big challenge to replicate this learning process for autonomous agents, due to the complexity of visual input, the absence of action or reward signals, and the limitations of interaction steps. In this paper, we propose a novel, unsupervised, and sample-efficient framework to achieve imitation learning from videos (ILV), named Behavior Cloning from Videos via Latent Representations (BCV-LR). BCV-LR extracts action-related latent features from high-dimensional video inputs through self-supervised tasks, and then leverages a dynamics-based unsupervised objective to predict latent actions between consecutive frames. The pre-trained latent actions are fine-tuned and efficiently aligned to the real action space online (with collected interactions) for policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Human Pose and Action Recognition
