From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Yihan Lin; Haoyang Li; Yang Li; Haitao Shen; Yihan Zhao; Chao Shao; Jing Zhang

arXiv:2605.04678·cs.RO·May 7, 2026

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Yihan Lin, Haoyang Li, Yang Li, Haitao Shen, Yihan Zhao, Chao Shao, Jing Zhang

PDF

1 Repo

TL;DR

This paper systematically compares different latent action supervision strategies in vision-language-action models, revealing their strengths in reasoning, generalization, and motor coordination, and demonstrating the effectiveness of discrete latent action tokens.

Contribution

It provides a unified framework for latent action supervision, compares four strategies, and offers insights into their respective advantages and optimal applications.

Findings

01

Image-based latent actions improve long-horizon reasoning.

02

Action-based latent actions enhance complex motor coordination.

03

Supervising with discrete latent action tokens yields the best performance.

Abstract

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RUCKBReasoning/From_Pixels_to_Tokens
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.