ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

Zerui Chen; Shizhe Chen; Etienne Arlaud; Ivan Laptev; Cordelia Schmid

arXiv:2404.15709·cs.CV·March 4, 2025·2 cites

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

Zerui Chen, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

PDF

Open Access

TL;DR

This paper introduces ViViDex, a framework that learns vision-based dexterous manipulation policies from human videos, overcoming noise and privileged information limitations to achieve superior performance in simulation and real-world tasks.

Contribution

ViViDex combines reinforcement learning with trajectory-guided rewards and a coordinate transformation to train unified visual policies from human videos without privileged information.

Findings

01

Outperforms state-of-the-art methods in three manipulation tasks

02

Effective in both simulation and real robot experiments

03

Improves visual policy learning from noisy human videos

Abstract

In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics

MethodsDiffusion