TL;DR
OVGGT introduces a resource-efficient, training-free framework for 3D reconstruction from streaming video, maintaining fixed memory and compute costs regardless of sequence length while achieving high accuracy.
Contribution
It combines Self-Selective Caching and Dynamic Anchor Protection to enable long-horizon streaming inference with constant resource usage, surpassing prior methods.
Findings
Processes arbitrarily long videos within fixed VRAM.
Achieves state-of-the-art 3D geometric accuracy.
Supports indoor, outdoor, and ultra-long sequences.
Abstract
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
