TL;DR
DFM-VLA introduces an iterative discrete flow matching approach for robotic manipulation, enabling dynamic action token refinement and outperforming existing decoding methods in accuracy and efficiency.
Contribution
It proposes a novel discrete flow matching framework for iterative action token refinement in vision-language-action models for robotics.
Findings
DFM-VLA outperforms autoregressive and diffusion baselines in manipulation tasks.
Achieves 95.7% success rate on LIBERO dataset.
Attains an average success length of 4.44 on CALVIN.
Abstract
Vision--Language--Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
