FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT
Zhisong Xu, Takeshi Oishi

TL;DR
FrameVGGT introduces a geometry-aware, bounded-memory framework for streaming 3D perception tasks, organizing frame contributions as coherent segments to improve long-term stability and efficiency.
Contribution
It proposes a novel frame-level memory organization with segment summarization and optional sparse anchors, enhancing long-horizon inference under fixed memory constraints.
Findings
Achieves better accuracy-memory trade-offs in 3D reconstruction, depth, and pose estimation.
Maintains more stable geometric reasoning over long streaming sequences.
Outperforms unbounded or token-level memory approaches in long-term tasks.
Abstract
Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We revisit bounded-memory streaming from the perspective of geometric support. Unlike language modeling, where useful information can often be compressed at the token level, geometry-driven reasoning depends on redundant and mutually compatible multi-view support. Under fixed budgets, token-level retention can fragment within-frame evidence, weaken the coherence of geometric support, and make stable long-horizon inference more difficult. Motivated by this observation, we propose FrameVGGT, a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. FrameVGGT summarizes each segment with a lightweight key-space prototype and maintains a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
