VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Juyi Lin; Amir Taherin; Arash Akbari; Arman Akbari; Lei Lu; Guangyu Chen; Taskin Padir; Xiaomeng Yang; Weiwei Chen; Yiqian Li; Xue Lin; David Kaeli; Pu Zhao; and Yanzhi Wang

arXiv:2507.05116·cs.CV·October 6, 2025

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, and Yanzhi Wang

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper introduces VOTE, a framework that reduces inference latency and training costs in vision-language-action models for robotic manipulation by generating fewer tokens and employing a voting ensemble for better action utilization, leading to higher success rates and faster inference.

Contribution

The paper presents a novel training and inference strategy for VLA models that significantly improves efficiency and performance through token reduction and ensemble voting.

Findings

01

Achieves 46 Hz throughput on edge platforms.

02

39× faster inference than OpenVLA.

03

Higher success rates in robotic tasks.

Abstract

Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LukeLIN-web/VOTE
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics

MethodsADaptive gradient method with the OPTimal convergence rate · Diffusion