Language-Driven Representation Learning for Robotics
Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar,, Chelsea Finn, Dorsa Sadigh, Percy Liang

TL;DR
This paper introduces Voltron, a novel language-driven representation learning framework that leverages human videos and captions to improve visual representations for diverse robotic tasks, outperforming existing methods.
Contribution
The paper presents Voltron, a new framework combining visual reconstruction and language grounding, and provides a comprehensive evaluation suite for robotic visual representations.
Findings
Voltron outperforms prior state-of-the-art methods across five robotic tasks.
Language-driven features improve high-level semantic understanding in robotic perception.
Existing methods show inconsistent results across different robotic vision tasks.
Abstract
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. Leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control. But, robot learning encompasses a diverse set of problems beyond control including grasp affordance prediction, language-conditioned imitation learning, and intent scoring for human-robot collaboration, amongst others. First, we demonstrate that existing representations yield inconsistent results across these tasks: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches capture the opposite. We then introduce Voltron, a framework for language-driven representation learning from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsContrastive Learning
