LIV: Language-Image Representations and Rewards for Robotic Control
Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang,, Osbert Bastani, Dinesh Jayaraman

TL;DR
LIV introduces a unified vision-language representation and reward learning framework from videos and text, enabling robots to understand and achieve goals in unseen environments with improved control and reward specification.
Contribution
The paper presents LIV, the first control-centric vision-language representation trained on large human video datasets, combining dual reinforcement learning and contrastive learning for robotic control.
Findings
LIV outperforms prior state representations in imitation learning.
LIV improves reward specification for policy synthesis.
LIV effectively generalizes to unseen environments and tasks.
Abstract
We present Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. Exploiting a novel connection between dual reinforcement learning and mutual information contrastive learning, the LIV objective trains a multi-modal representation that implicitly encodes a universal value function for tasks specified as language or image goals. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen. Given only a language or image goal, the pre-trained LIV model can assign dense rewards to each frame in videos of unseen robots or humans attempting that task in unseen environments. Further, when some target domain-specific data is available, the same objective can be used to fine-tune and improve LIV and even other pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · interferon and immune responses · Domain Adaptation and Few-Shot Learning
