VG3T: Visual Geometry Grounded Gaussian Transformer

Junho Kim; Seongwon Lee

arXiv:2512.05988·cs.CV·December 9, 2025

VG3T: Visual Geometry Grounded Gaussian Transformer

Junho Kim, Seongwon Lee

PDF

Open Access

TL;DR

VG3T introduces a multi-view Gaussian transformer that improves 3D scene reconstruction by directly predicting semantically attributed Gaussians, enhancing coherence and efficiency over prior view-by-view methods.

Contribution

The paper presents a novel multi-view Gaussian prediction framework with Grid-Based Sampling and Positional Refinement, addressing fragmentation and density bias issues in 3D scene modeling.

Findings

01

Achieves 1.7% higher mIoU on nuScenes benchmark.

02

Uses 46% fewer primitives than previous state-of-the-art.

03

Demonstrates improved coherence and efficiency in 3D scene representation.

Abstract

Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization