A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong; Fengshuo Bai; Shaofei Cai; Xuchuan Huang; Zhang Chen; Xiaowei Zhang; Yuanfei Wang; Shaoyang Guo; Tianrui Guan; Ka Nam Lui; Zhiquan Qi; Yitao Liang; Yuanpei Chen; Yaodong Yang

arXiv:2507.01925·cs.RO·July 3, 2025

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

PDF

Open Access

TL;DR

This survey reviews vision-language-action models, focusing on how they use various types of action tokens to ground and generate executable actions, aiming to unify understanding and guide future research in this evolving field.

Contribution

It categorizes and interprets existing VLA models through action tokenization, providing a comprehensive framework and identifying promising directions for future development.

Findings

01

Unified framework for VLA models via action tokens

02

Categorization of action token types and their strengths/limitations

03

Guidance for future research in VLA model development

Abstract

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Ethics and Social Impacts of AI