ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Lin Sha; Haiyun Guo; Tao Wang; Cong Zhang; Min Huang; Jinqiao Wang; Qinghai Miao

arXiv:2604.19145·cs.CV·April 22, 2026

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao

PDF

TL;DR

ST-Prune introduces a training-free, spatio-temporal token pruning framework for vision-language models in autonomous driving, effectively reducing computation while maintaining near-original performance.

Contribution

It proposes a novel, plug-and-play spatio-temporal pruning method combining motion-aware temporal and ring-view spatial modules, tailored for autonomous driving scenarios.

Findings

01

Achieves near-lossless performance at 90% token reduction.

02

Sets new state-of-the-art for training-free token pruning in autonomous driving.

03

Maintains inference speed comparable to existing methods.

Abstract

Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.