Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective

Huan Li; Longjun Luo; Yuling Shi; Xiaodong Gu

arXiv:2512.21691·cs.CV·December 29, 2025

Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective

Huan Li, Longjun Luo, Yuling Shi, Xiaodong Gu

PDF

Open Access

TL;DR

This paper provides a mathematical analysis of attention collapse in VGGT, revealing how token features converge to a degenerate state and proposing insights for improving scalable 3D-vision transformers.

Contribution

It introduces a mean-field PDE model that predicts attention collapse dynamics and explains the effectiveness of token-merging remedies in VGGT.

Findings

01

Attention matrices become near rank-one with many frames.

02

Token features converge to a Dirac-type measure at a rate of O(1/L).

03

Token-merging delays collapse by reducing the diffusion coefficient.

Abstract

Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates super-linearly.In this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion process.We prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a $O (1/ L)$ rate, where $L$ is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank profile.The theory quantitatively matches the attention-heat-map evolution and a series…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Optical Imaging Technologies · Advanced Memory and Neural Computing · Advanced Vision and Imaging