StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

TL;DR
StreamCacheVGGT introduces a novel, training-free framework for dense 3D reconstruction from video streams, enhancing stability and accuracy under fixed memory limits through advanced cache management.
Contribution
It proposes CLCES and HCC modules that improve token importance tracking and cache compression, surpassing existing eviction-based methods without additional training.
Findings
Achieves state-of-the-art results on five benchmarks.
Demonstrates improved long-term stability in 3D reconstruction.
Maintains high accuracy within constant memory constraints.
Abstract
Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
