Videos are Sample-Efficient Supervisions: Behavior Cloning from Videos via Latent Representations

Xin Liu; Haoran Li; Dongbin Zhao

arXiv:2512.21586·cs.LG·December 29, 2025

Videos are Sample-Efficient Supervisions: Behavior Cloning from Videos via Latent Representations

Xin Liu, Haoran Li, Dongbin Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces BCV-LR, a novel unsupervised framework that enables highly sample-efficient imitation learning from videos by extracting and aligning latent action representations, outperforming existing methods in visual control tasks.

Contribution

The paper presents the first method to achieve sample-efficient visual policy learning directly from videos without any additional supervision, using latent representations and iterative policy refinement.

Findings

01

Outperforms state-of-the-art ILV and RL methods in sample efficiency.

02

Enables expert-level performance on some visual control tasks.

03

Demonstrates that videos alone can support highly efficient policy learning.

Abstract

Humans can efficiently extract knowledge and learn skills from the videos within only a few trials and errors. However, it poses a big challenge to replicate this learning process for autonomous agents, due to the complexity of visual input, the absence of action or reward signals, and the limitations of interaction steps. In this paper, we propose a novel, unsupervised, and sample-efficient framework to achieve imitation learning from videos (ILV), named Behavior Cloning from Videos via Latent Representations (BCV-LR). BCV-LR extracts action-related latent features from high-dimensional video inputs through self-supervised tasks, and then leverages a dynamics-based unsupervised objective to predict latent actions between consecutive frames. The pre-trained latent actions are fine-tuned and efficiently aligned to the real action space online (with collected interactions) for policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Videos are Sample-Efficient Supervisions: Behavior Cloning from Videos via Latent Representations· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Human Pose and Action Recognition