Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng; Shijia Huang; Yanyang Li; Liwei Wang

arXiv:2505.24625·cs.CV·October 23, 2025

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VG LLM, a novel approach that enhances multimodal large language models to understand 3D scenes directly from videos by extracting 3D geometry priors without additional 3D data inputs.

Contribution

The paper presents a new method for 3D scene understanding from videos using a 3D visual geometry encoder integrated with MLLMs, eliminating the need for explicit 3D data.

Findings

01

Achieves substantial improvements in 3D scene understanding tasks.

02

Outperforms existing state-of-the-art methods on VSI-Bench evaluations.

03

Demonstrates competitive results with a 4B parameter model without explicit 3D data.

Abstract

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method called the Video-3D Geometry Large Language Model (VG LLM). Our approach utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences. This information is then integrated with visual tokens and input into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LaVi-Lab/Video-3D-LLM
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning