DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Yuquan Li; Lianjie Ma; Han Ding; Lijun Zhu

arXiv:2603.10469·cs.RO·March 12, 2026

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Yuquan Li, Lianjie Ma, Han Ding, Lijun Zhu

PDF

Open Access

TL;DR

DepthCache is a training-free, depth-guided visual token merging framework that significantly speeds up vision-language-action model inference for robotic manipulation without sacrificing much accuracy.

Contribution

It introduces a novel depth-based, spatially differentiated token merging method that preserves critical near-field information and exploits temporal redundancy, applicable across diverse VLA models.

Findings

01

Up to 1.28x inference speedup with less than 1% success rate degradation

02

Outperforms pruning and merging baselines with 4-24% less accuracy loss

03

Enhances real-world robotic control responsiveness and throughput

Abstract

Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning