VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, and Yanzhi Wang

TL;DR
This paper introduces VOTE, a framework that reduces inference latency and training costs in vision-language-action models for robotic manipulation by generating fewer tokens and employing a voting ensemble for better action utilization, leading to higher success rates and faster inference.
Contribution
The paper presents a novel training and inference strategy for VLA models that significantly improves efficiency and performance through token reduction and ensemble voting.
Findings
Achieves 46 Hz throughput on edge platforms.
39× faster inference than OpenVLA.
Higher success rates in robotic tasks.
Abstract
Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
MethodsADaptive gradient method with the OPTimal convergence rate · Diffusion
