Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Yi Chen; Yuying Ge; Weiliang Tang; Yizhuo Li; Yixiao Ge; Mingyu Ding; Ying Shan; Xihui Liu

arXiv:2412.04445·cs.RO·October 17, 2025

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Moto, a model that learns a language of motion from videos to improve robot manipulation by transferring visual motion knowledge to real robot actions, demonstrating robustness and efficiency.

Contribution

We propose Moto, a novel approach that converts videos into latent motion tokens and pre-trains a GPT model for robotic manipulation, enabling effective transfer of motion knowledge from videos to robots.

Findings

01

Moto-GPT can generate interpretable motion tokens.

02

It predicts plausible motion trajectories.

03

It shows improved robustness and efficiency on benchmarks.

Abstract

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/moto
pytorchOfficial

Models

🤗
TencentARC/Moto
model· 6 dl· ♡ 7
6 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robotics and Automated Systems