DMotion: Robotic Visuomotor Control with Unsupervised Forward Model Learned from Videos
Haoqi Yuan, Ruihai Wu, Andrew Zhao, Haipeng Zhang, Zihan Ding, Hao, Dong

TL;DR
DMotion introduces an unsupervised video-based approach for robotic visuomotor control by learning a forward model that disentangles controllable agent motion, enabling effective model predictive control without labeled data.
Contribution
The paper presents DMotion, a novel method that learns an environment forward model solely from videos, using end-to-end training with disentangled agent motion and physical interpretable transformations.
Findings
Achieves superior forward model accuracy in Grid World and robotic simulation environments.
Demonstrates effective robotic manipulation using learned models in model predictive control.
Operates without requiring labeled actions or object annotations.
Abstract
Learning an accurate model of the environment is essential for model-based control tasks. Existing methods in robotic visuomotor control usually learn from data with heavily labelled actions, object entities or locations, which can be demanding in many cases. To cope with this limitation, we propose a method, dubbed DMotion, that trains a forward model from video data only, via disentangling the motion of controllable agent to model the transition dynamics. An object extractor and an interaction learner are trained in an end-to-end manner without supervision. The agent's motions are explicitly represented using spatial transformation matrices containing physical meanings. In the experiments, DMotion achieves superior performance on learning an accurate forward model in a Grid World environment, as well as a more realistic robot control environment in simulation. With the accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications
