FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair,, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine

TL;DR
This paper introduces FAST, a novel frequency-based action tokenization method for vision-language-action models, enabling efficient learning of dexterous, high-frequency robotic behaviors and significantly reducing training time.
Contribution
We propose FAST, a compression-based tokenization scheme using discrete cosine transform, and release FAST+ as a universal robot action tokenizer trained on extensive real robot data.
Findings
FAST enables training on high-frequency, dexterous tasks where standard methods fail.
Using FAST+ improves training efficiency, reducing time by up to 5x.
Our approach matches diffusion VLA performance on large-scale robot data.
Abstract
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗KarlP/fast-droidmodel· ♡ 2♡ 2
- 🤗lipsop/so101_pi0fast_100epmodel· 1 dl1 dl
- 🤗jianqiang03/my_policymodel· 2 dl2 dl
- 🤗maharishiva/pi0fast_tictactoe09model
- 🤗observabot/pi0fast_so101_die_mat3_b8_lr1e-4_cs50_nas50_robomodel
- 🤗omkarmayekar555/pi0fast_testing-training_19Julymodel· 1 dl1 dl
- 🤗observabot/pi0fast_so101_die_mat3_b8_lr1e-4_cs100_nas100_robomodel
- 🤗sucrammal/vla-reasoning-v2model
- 🤗observabot/pi0fast_so101_die_mat4_b24_lr1e-2_cs100_nas100_robomodel· 2 dl2 dl
- 🤗observabot/pi0fast_so101_cloth_folding1_b24_lr1e-3_cs100_nas100_robomodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsDiffusion
