Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
Kaijun Zhou, Qiwei Chen, Da Peng, Zhiyang Li, Xijun Li, Jinyu Gu

TL;DR
This paper systematically analyzes vision-language-action models on various edge hardware, revealing bottlenecks and proposing methods to accelerate inference for real-time robot control with minimal accuracy loss.
Contribution
It introduces a cross-accelerator leaderboard, uncovers inference phase patterns, and proposes DP-Cache and V-AEFusion to significantly speed up models on edge devices.
Findings
Edge devices can be more cost- and energy-efficient than flagship GPUs.
Inference involves a compute-bound backbone and a memory-bound Action Expert.
Proposed methods achieve up to 2.9x GPU speedup and 6x edge NPU speedup.
Abstract
Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
