DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference
Yuquan Li, Lianjie Ma, Han Ding, Lijun Zhu

TL;DR
DepthCache is a training-free, depth-guided visual token merging framework that significantly speeds up vision-language-action model inference for robotic manipulation without sacrificing much accuracy.
Contribution
It introduces a novel depth-based, spatially differentiated token merging method that preserves critical near-field information and exploits temporal redundancy, applicable across diverse VLA models.
Findings
Up to 1.28x inference speedup with less than 1% success rate degradation
Outperforms pruning and merging baselines with 4-24% less accuracy loss
Enhances real-world robotic control responsiveness and throughput
Abstract
Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
