OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Haosong Peng; Hao Li; Yalun Dai; Yushi Lan; Yihang Luo; Tianyu Qi; Zhengshen Zhang; Yufeng Zhan; Junfei Zhang; Wenchao Xu; Ziwei Liu

arXiv:2511.10560·cs.CV·November 17, 2025

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu

PDF

Open Access 1 Models

TL;DR

OmniVGGT is a versatile framework that integrates multiple geometric modalities into vision transformers, improving 3D understanding tasks and enhancing vision-language-action models without sacrificing inference speed.

Contribution

It introduces a GeoAdapter for encoding geometric cues and a stochastic modality fusion strategy, enabling effective multi-modal learning with minimal overhead.

Findings

01

Outperforms prior methods with auxiliary inputs in depth and pose estimation

02

Achieves state-of-the-art results even with RGB-only inputs

03

Enhances vision-language-action models for robotic tasks

Abstract

General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Livioni/OmniVGGT
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robot Manipulation and Learning · Robotics and Sensor-Based Localization