Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Kaijun Zhou; Qiwei Chen; Da Peng; Zhiyang Li; Xijun Li; Jinyu Gu

arXiv:2604.24447·cs.RO·April 28, 2026

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Kaijun Zhou, Qiwei Chen, Da Peng, Zhiyang Li, Xijun Li, Jinyu Gu

PDF

TL;DR

This paper systematically analyzes vision-language-action models on various edge hardware, revealing bottlenecks and proposing methods to accelerate inference for real-time robot control with minimal accuracy loss.

Contribution

It introduces a cross-accelerator leaderboard, uncovers inference phase patterns, and proposes DP-Cache and V-AEFusion to significantly speed up models on edge devices.

Findings

01

Edge devices can be more cost- and energy-efficient than flagship GPUs.

02

Inference involves a compute-bound backbone and a memory-bound Action Expert.

03

Proposed methods achieve up to 2.9x GPU speedup and 6x edge NPU speedup.

Abstract

Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.