FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch; Kyle Stachowicz; Brian Ichter; Danny Driess; Suraj Nair,; Quan Vuong; Oier Mees; Chelsea Finn; Sergey Levine

arXiv:2501.09747·cs.RO·January 17, 2025

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair,, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine

PDF

Open Access 10 Models

TL;DR

This paper introduces FAST, a novel frequency-based action tokenization method for vision-language-action models, enabling efficient learning of dexterous, high-frequency robotic behaviors and significantly reducing training time.

Contribution

We propose FAST, a compression-based tokenization scheme using discrete cosine transform, and release FAST+ as a universal robot action tokenizer trained on extensive real robot data.

Findings

01

FAST enables training on high-frequency, dexterous tasks where standard methods fail.

02

Using FAST+ improves training efficiency, reducing time by up to 5x.

03

Our approach matches diffusion VLA performance on large-scale robot data.

Abstract

Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsDiffusion