LaVA-Man: Learning Visual Action Representations for Robot Manipulation

Chaoran Zhu; Hengyi Wang; Yik Lung Pang; Changjae Oh

arXiv:2508.19391·cs.RO·September 30, 2025

LaVA-Man: Learning Visual Action Representations for Robot Manipulation

Chaoran Zhu, Hengyi Wang, Yik Lung Pang, Changjae Oh

PDF

Open Access

TL;DR

This paper introduces LaVA-Man, a self-supervised learning approach for visual action representations in robot manipulation, utilizing a new dataset and outperforming previous methods in diverse benchmarks.

Contribution

The paper proposes a novel self-supervised pretext task for learning visual-textual associations without robot action supervision, and introduces the Omni-Object Pick-and-Place dataset for comprehensive evaluation.

Findings

01

Outperforms prior art on five benchmarks

02

Effective in both simulated and real-robot settings

03

Learns diverse object priors for manipulation

Abstract

Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-textual associations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition