Pixel Motion as Universal Representation for Robot Control

Kanchana Ranasinghe; Xiang Li; E-Ro Nguyen; Cristina Mata; Jongwoo Park; Michael S Ryoo

arXiv:2505.07817·cs.RO·August 29, 2025

Pixel Motion as Universal Representation for Robot Control

Kanchana Ranasinghe, Xiang Li, E-Ro Nguyen, Cristina Mata, Jongwoo Park, Michael S Ryoo

PDF

Open Access 3 Reviews

TL;DR

LangToMo introduces a hierarchical framework that uses pixel motion forecasts as an intermediate, interpretable representation to enable flexible, scalable, and generalizable robot control guided by vision, language, and motion data.

Contribution

It presents a novel dual-system architecture that leverages pixel motion forecasts as universal representations for robot control, trained on weakly-supervised video-caption data.

Findings

01

Effective translation of pixel motion into robot actions.

02

Flexible control under supervised and unsupervised settings.

03

Scalable approach bridging language, motion, and action.

Abstract

We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a weakly-supervised manner, enabling diffusion model training on any video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals.…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper identifies a key bottleneck in robot learning from videos: the need for action supervision and embodiment-specific data. The idea of treating pixel motion as a universal, interpretable, and embodiment-agnostic abstraction is good. 2. Dual-system design also satisfied the real-time issue of robot policy

Weaknesses

1. Although the idea of using pixel motion as action representation is nice, I feel the idea is widely studied in previous work. The author list the difference with previous works at Table 1. I feel the idea is a little bit incremental. 2. For the simulation experiments, the author only did experiments on 11 Metaworld benchmarks tasks, which is limited. Many previous works train language conditioned policy on the whole Metaworld benchmark. Also, Metaworld is not designed for language-conditioned

Reviewer 02Rating 2Confidence 3

Strengths

The writing is clear and easy to follow. The method is able to leverage the unlabeled human data and enable scaled learning.

Weaknesses

The novelty is limited. Many papers have explored the idea of extracting universal action representation from videos. The performance is only evaluated on MetaWorld and the performance gain is marginal compared to ATM. More evaluations are needed. See questions below.

Reviewer 03Rating 2Confidence 4

Strengths

- The proposed two-stage framework preserves the original model’s capabilities while enabling the transformation from vision-language signals to action representations. - Compared to related work, LangToMo employs a diffusion model to directly predict pixel motion instead of generating full video sequences. - Surpass other baseline method in real world zero-shot tasks via large-scale pretraining.

Weaknesses

- **Longer inference latency** Similar to UniPi, many steps of denoising are required when the diffusion model predicts pixels or PMs, which leads to long inference delays and false closed-loop control, which limits the model to static scenes. - **Weak evaluation** Choosing Metaworld benchmark in main text experiment for VLA models is less convincing. Metaworld tasks and scenarios are relatively simple, and accurate action prediction can be achieved using images alone without requiring

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion