AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Dejie Yang; Zijing Zhao; Yang Liu

arXiv:2508.07626·cs.CV·August 12, 2025

AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Dejie Yang, Zijing Zhao, Yang Liu

PDF

Open Access

TL;DR

AR-VRM introduces a novel approach for visual robot manipulation by explicitly imitating human actions through analogical reasoning, leveraging large-scale human video data to improve generalization and performance in data-scarce scenarios.

Contribution

The paper presents a new method that explicitly learns from human action videos and uses analogical reasoning to improve robot manipulation, addressing data scarcity and generalization issues.

Findings

01

Achieves leading performance on CALVIN benchmark.

02

Outperforms previous methods significantly in few-shot scenarios.

03

Effectively imitates human actions for robotic manipulation.

Abstract

Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning