LaVA-Man: Learning Visual Action Representations for Robot Manipulation
Chaoran Zhu, Hengyi Wang, Yik Lung Pang, Changjae Oh

TL;DR
This paper introduces LaVA-Man, a self-supervised learning approach for visual action representations in robot manipulation, utilizing a new dataset and outperforming previous methods in diverse benchmarks.
Contribution
The paper proposes a novel self-supervised pretext task for learning visual-textual associations without robot action supervision, and introduces the Omni-Object Pick-and-Place dataset for comprehensive evaluation.
Findings
Outperforms prior art on five benchmarks
Effective in both simulated and real-robot settings
Learns diverse object priors for manipulation
Abstract
Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-textual associations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
