Mean-Flow based One-Step Vision-Language-Action
Yang Chen, Xiaoguang Ma, Bin Zhao

TL;DR
This paper introduces a Mean-Flow based one-step approach for vision-language-action tasks that significantly reduces generation latency, enabling faster robotic manipulation without sacrificing performance.
Contribution
It proposes a novel Mean-Flow method that resolves noise issues in action generation, allowing one-step, high-efficiency VLA for robotic tasks.
Findings
Generation speed is 8.7 times faster than SmolVLA.
Generation speed is 83.9 times faster than Diffusion Policy.
Effective in real-world robotic experiments.
Abstract
Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
