FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Yicheng Liu; Shiduo Zhang; Zibin Dong; Baijun Ye; Tianyuan Yuan; Xiaopeng Yu; Linqi Yin; Chenhao Lu; Junhao Shi; Luca Jiang-Tao Yu; Liangtao Zheng; Tao Jiang; Jingjing Gong; Xipeng Qiu; Hang Zhao

arXiv:2512.04952·cs.CV·December 9, 2025

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao

PDF

Open Access

TL;DR

FASTer introduces a neural action tokenizer and an efficient autoregressive framework for vision-language-action models, significantly improving inference speed and task performance in robotic manipulation.

Contribution

The paper presents FASTer, a novel unified framework combining a learnable tokenizer with autoregressive policies, enhancing efficiency and generalization in robot learning tasks.

Findings

01

FASTerVQ achieves high-quality action chunk encoding with high compression.

02

FASTerVLA outperforms previous models in inference speed and task success rates.

03

Extensive experiments validate FASTer's superior generalization and efficiency.

Abstract

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning