FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

Zhisong Xu; Takeshi Oishi

arXiv:2603.07690·cs.CV·April 21, 2026

FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

Zhisong Xu, Takeshi Oishi

PDF

TL;DR

FrameVGGT introduces a geometry-aware, bounded-memory framework for streaming 3D perception tasks, organizing frame contributions as coherent segments to improve long-term stability and efficiency.

Contribution

It proposes a novel frame-level memory organization with segment summarization and optional sparse anchors, enhancing long-horizon inference under fixed memory constraints.

Findings

01

Achieves better accuracy-memory trade-offs in 3D reconstruction, depth, and pose estimation.

02

Maintains more stable geometric reasoning over long streaming sequences.

03

Outperforms unbounded or token-level memory approaches in long-term tasks.

Abstract

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We revisit bounded-memory streaming from the perspective of geometric support. Unlike language modeling, where useful information can often be compressed at the token level, geometry-driven reasoning depends on redundant and mutually compatible multi-view support. Under fixed budgets, token-level retention can fragment within-frame evidence, weaken the coherence of geometric support, and make stable long-horizon inference more difficult. Motivated by this observation, we propose FrameVGGT, a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. FrameVGGT summarizes each segment with a lightweight key-space prototype and maintains a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.