Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

TL;DR
This paper introduces VG LLM, a novel approach that enhances multimodal large language models to understand 3D scenes directly from videos by extracting 3D geometry priors without additional 3D data inputs.
Contribution
The paper presents a new method for 3D scene understanding from videos using a 3D visual geometry encoder integrated with MLLMs, eliminating the need for explicit 3D data.
Findings
Achieves substantial improvements in 3D scene understanding tasks.
Outperforms existing state-of-the-art methods on VSI-Bench evaluations.
Demonstrates competitive results with a 4B parameter model without explicit 3D data.
Abstract
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method called the Video-3D Geometry Large Language Model (VG LLM). Our approach utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences. This information is then integrated with visual tokens and input into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning
