StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Xuanyi Liu; Chunan Yu; Deyi Ji; Qi Zhu; Lingyun Sun; Xuanfu Li; Jin Ma; Tianrun Chen; Lanyun Zhu

arXiv:2604.15237·cs.CV·April 20, 2026

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

PDF

TL;DR

StreamCacheVGGT introduces a novel, training-free framework for dense 3D reconstruction from video streams, enhancing stability and accuracy under fixed memory limits through advanced cache management.

Contribution

It proposes CLCES and HCC modules that improve token importance tracking and cache compression, surpassing existing eviction-based methods without additional training.

Findings

01

Achieves state-of-the-art results on five benchmarks.

02

Demonstrates improved long-term stability in 3D reconstruction.

03

Maintains high accuracy within constant memory constraints.

Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O (1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.