Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu

TL;DR
This paper introduces Moto, a model that learns a language of motion from videos to improve robot manipulation by transferring visual motion knowledge to real robot actions, demonstrating robustness and efficiency.
Contribution
We propose Moto, a novel approach that converts videos into latent motion tokens and pre-trains a GPT model for robotic manipulation, enabling effective transfer of motion knowledge from videos to robots.
Findings
Moto-GPT can generate interpretable motion tokens.
It predicts plausible motion trajectories.
It shows improved robustness and efficiency on benchmarks.
Abstract
Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robotics and Automated Systems
