Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene   Understanding

Duo Zheng; Shijia Huang; Liwei Wang

arXiv:2412.00493·cs.CV·March 28, 2025

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, Liwei Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Video-3D LLM, a model that treats 3D scenes as videos with position encoding, significantly improving spatial understanding and achieving state-of-the-art results on multiple 3D scene benchmarks.

Contribution

The paper proposes a novel Video-3D LLM that incorporates 3D position encoding and a coverage sampling technique to enhance 3D scene understanding in multimodal large language models.

Findings

01

Achieves state-of-the-art performance on 3D scene benchmarks

02

Effectively models spatial relationships in 3D environments

03

Outperforms previous methods in 3D scene understanding

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LaVi-Lab/Video-3D-LLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging