ActionCodec: What Makes for Good Action Tokenizers

Zibin Dong; Yicheng Liu; Shiduo Zhang; Baijun Ye; Yifu Yuan; Fei Ni; Jingjing Gong; Xipeng Qiu; Hang Zhao; Yinchuan Li; Jianye Hao

arXiv:2602.15397·cs.RO·February 18, 2026

ActionCodec: What Makes for Good Action Tokenizers

Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, Jianye Hao

PDF

Open Access

TL;DR

This paper introduces ActionCodec, a set of design principles for action tokenizers in vision-language-action models, leading to significant improvements in training efficiency and performance without robotics pre-training.

Contribution

It establishes new design principles for action tokenizers based on information theory and develops ActionCodec, a high-performance tokenizer that advances VLA model capabilities.

Findings

01

Achieves 95.5% success rate on LIBERO without robotics pre-training

02

Reaches 97.4% success with architectural enhancements, setting new SOTA

03

Enhances training efficiency and VLA performance across benchmarks

Abstract

Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis