VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

Xun Chen; Tianchen Deng; Rui Wang; Fangjinhua Wang; Junyi Ma; Hongming Shen,Hesheng Wang; Danwei Wang

arXiv:2605.16911·cs.CV·May 19, 2026

VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

Xun Chen, Tianchen Deng, Rui Wang, Fangjinhua Wang, Junyi Ma, Hongming Shen,Hesheng Wang, Danwei Wang

PDF

TL;DR

VGGT-Occ introduces a geometry-grounded, density-aware fusion framework for 3D occupancy prediction, enhancing accuracy by embedding geometric constraints throughout the process.

Contribution

The paper proposes Projection-Aware Deformable Attention and a view-quality semantic gate, integrating geometry into all attention stages and improving efficiency with a coarse-to-fine decoder.

Findings

01

Achieves 33.00% IoU on SurroundOcc-nuScenes with 41M parameters.

02

Outperforms existing methods in 3D occupancy prediction accuracy.

03

Reduces decoder cost while maintaining high performance.

Abstract

3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.