Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng, Shijia Huang, Liwei Wang

TL;DR
This paper introduces Video-3D LLM, a model that treats 3D scenes as videos with position encoding, significantly improving spatial understanding and achieving state-of-the-art results on multiple 3D scene benchmarks.
Contribution
The paper proposes a novel Video-3D LLM that incorporates 3D position encoding and a coverage sampling technique to enhance 3D scene understanding in multimodal large language models.
Findings
Achieves state-of-the-art performance on 3D scene benchmarks
Effectively models spatial relationships in 3D environments
Outperforms previous methods in 3D scene understanding
Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
