Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective
Huan Li, Longjun Luo, Yuling Shi, Xiaodong Gu

TL;DR
This paper provides a mathematical analysis of attention collapse in VGGT, revealing how token features converge to a degenerate state and proposing insights for improving scalable 3D-vision transformers.
Contribution
It introduces a mean-field PDE model that predicts attention collapse dynamics and explains the effectiveness of token-merging remedies in VGGT.
Findings
Attention matrices become near rank-one with many frames.
Token features converge to a Dirac-type measure at a rate of O(1/L).
Token-merging delays collapse by reducing the diffusion coefficient.
Abstract
Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates super-linearly.In this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion process.We prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a rate, where is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank profile.The theory quantitatively matches the attention-heat-map evolution and a series…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optical Imaging Technologies · Advanced Memory and Neural Computing · Advanced Vision and Imaging
