EdgeVLA: Efficient Vision-Language-Action Models
Pawe{\l} Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Winston Hsiao, Aaron Xie, Tomasz M{\l}oduchowski, Viraj Tipnis, Benjamin Bolte

TL;DR
EdgeVLA introduces a fast, resource-efficient vision-language-action model for robotics, enabling real-time performance on edge devices without sacrificing accuracy.
Contribution
The paper presents EVLA, a novel method that accelerates VLA inference by 7x and reduces computational demands using small language models, maintaining performance.
Findings
7x inference speedup on edge devices
Comparable training performance to larger models
Significant reduction in memory usage
Abstract
Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
