SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

Ye Li; Yuan Meng; Zewen Sun; Kangye Ji; Chen Tang; Jiajun Fan; Xinzhu Ma; Shutao Xia; Zhi Wang; Wenwu Zhu

arXiv:2506.12723·cs.CV·October 6, 2025

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SP-VLA, a unified framework that accelerates Vision-Language-Action models by jointly scheduling models and pruning tokens, addressing temporal and spatial redundancies for real-time applications.

Contribution

The paper proposes a novel joint model scheduling and token pruning approach tailored for VLA models, incorporating action-aware scheduling and dual-aware token pruning for efficiency.

Findings

01

Achieves 1.5× lossless acceleration in LIBERO

02

Achieves 2.4× lossless acceleration in SimplerEnv

03

Improves inference frequency and latency significantly

Abstract

Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper presents a simple method to speed VLA inference based on a simple observation that most actions predicted by the VLA do not require the full processing power of a VLA (intuitive actions) and can rather be estimated using a super simple ridge regression. 2. The method shows significant improvement in run-time of VLA on 2 simulated benchmarks.

Weaknesses

***1. How are hyper-parameters chosen?*** My biggest concern with this paper is that performance of both model scheduling and token pruning are heavily reliant on the hyper-parameters and there is not much detail on how they are chosen. Are these simulator/task-dependent? In which case, do you select them on a subset of data and then evaluate on a larger set to see if they transfer? How can these parameters be selected for real-world tasks where evaluation is not cheap to run and each task m

Reviewer 02Rating 6Confidence 3

Strengths

* **Perception tokens pruning**: the spatio-semantic dual-aware token pruning strategy is well-motivated and empirically validated. By combining semantic attention with spatial cues, the method preserves critical spatial information achieves faster inference without significant accuracy loss. * **Comprehensive experiments**: The paper provides extensive experiments on standard benchmarks (LIBERO, SimplerEnv), including ablation studies and comparisons to relevant baselines.

Weaknesses

* **Loosely-connected contributions**: The paper presents two main ideas (action scheduling and token pruning) that are only loosely connected, apart from the common goal. * **Limited generality**: the introduced novelties are "simpler" than the VLA model and they introduce a new set of hyperparameters, which hinders generality of the approach. This limitation particularly affects the action part, where different embodiments may have very different action spaces and thus, defining hyperparameter

Reviewer 03Rating 4Confidence 3

Strengths

1. Interesting idea that targets time + space waste; easy to plug into different VLA stacks. 2. Solid gains with small or no accuracy drop; ablations back up the design. 3. Explains why semantics-only pruning breaks VLA (loses spatial order).

Weaknesses

1. Relies on a few heuristic knobs (speed window, buffer length, gating τ); some sensitivity. 2. The lightweight head assumes near-linear short-horizon motion; can fail under contact/perturbations. 3. Canny edge detection can be fragile in the presence of lighting and material noise. Additionally, there is limited evidence from real-robot experiments and little information on tail latency and energy consumption.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning