Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation
Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Nak Young Chong, Xiem HoangVan

TL;DR
This paper introduces a modular imitation learning framework enabling robots to learn manipulation skills directly from unstructured video demonstrations, combining visual understanding with reinforcement learning for improved generalization and accuracy.
Contribution
The authors propose a novel two-stage pipeline that decouples video understanding from robot imitation, utilizing TSM, VLMs, and TD3-based RL, which enhances learning efficiency and generalization in robot skill acquisition.
Findings
Achieved 89.97% action classification accuracy in video understanding.
Reached 87.5% success rate in robot manipulation tasks.
Significant improvements over baseline methods in accuracy and generalization.
Abstract
Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
