Pixel Motion as Universal Representation for Robot Control
Kanchana Ranasinghe, Xiang Li, E-Ro Nguyen, Cristina Mata, Jongwoo Park, Michael S Ryoo

TL;DR
LangToMo introduces a hierarchical framework that uses pixel motion forecasts as an intermediate, interpretable representation to enable flexible, scalable, and generalizable robot control guided by vision, language, and motion data.
Contribution
It presents a novel dual-system architecture that leverages pixel motion forecasts as universal representations for robot control, trained on weakly-supervised video-caption data.
Findings
Effective translation of pixel motion into robot actions.
Flexible control under supervised and unsupervised settings.
Scalable approach bridging language, motion, and action.
Abstract
We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a weakly-supervised manner, enabling diffusion model training on any video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper identifies a key bottleneck in robot learning from videos: the need for action supervision and embodiment-specific data. The idea of treating pixel motion as a universal, interpretable, and embodiment-agnostic abstraction is good. 2. Dual-system design also satisfied the real-time issue of robot policy
1. Although the idea of using pixel motion as action representation is nice, I feel the idea is widely studied in previous work. The author list the difference with previous works at Table 1. I feel the idea is a little bit incremental. 2. For the simulation experiments, the author only did experiments on 11 Metaworld benchmarks tasks, which is limited. Many previous works train language conditioned policy on the whole Metaworld benchmark. Also, Metaworld is not designed for language-conditioned
The writing is clear and easy to follow. The method is able to leverage the unlabeled human data and enable scaled learning.
The novelty is limited. Many papers have explored the idea of extracting universal action representation from videos. The performance is only evaluated on MetaWorld and the performance gain is marginal compared to ATM. More evaluations are needed. See questions below.
- The proposed two-stage framework preserves the original model’s capabilities while enabling the transformation from vision-language signals to action representations. - Compared to related work, LangToMo employs a diffusion model to directly predict pixel motion instead of generating full video sequences. - Surpass other baseline method in real world zero-shot tasks via large-scale pretraining.
- **Longer inference latency** Similar to UniPi, many steps of denoising are required when the diffusion model predicts pixels or PMs, which leads to long inference delays and false closed-loop control, which limits the model to static scenes. - **Weak evaluation** Choosing Metaworld benchmark in main text experiment for VLA models is less convincing. Metaworld tasks and scenarios are relatively simple, and accurate action prediction can be achieved using images alone without requiring
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
