Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

Thanh Nguyen Canh; Thanh-Tuan Tran; Haolan Zhang; Ziyan Gao; Nak Young Chong; Xiem HoangVan

arXiv:2602.19184·cs.RO·February 24, 2026

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Nak Young Chong, Xiem HoangVan

PDF

Open Access

TL;DR

This paper introduces a modular imitation learning framework enabling robots to learn manipulation skills directly from unstructured video demonstrations, combining visual understanding with reinforcement learning for improved generalization and accuracy.

Contribution

The authors propose a novel two-stage pipeline that decouples video understanding from robot imitation, utilizing TSM, VLMs, and TD3-based RL, which enhances learning efficiency and generalization in robot skill acquisition.

Findings

01

Achieved 89.97% action classification accuracy in video understanding.

02

Reached 87.5% success rate in robot manipulation tasks.

03

Significant improvements over baseline methods in accuracy and generalization.

Abstract

Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics