3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene   Understanding

Haomiao Xiong; Yunzhi Zhuge; Jiawen Zhu; Lu Zhang; Huchuan Lu

arXiv:2501.07819·cs.CV·January 15, 2025

3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding

Haomiao Xiong, Yunzhi Zhuge, Jiawen Zhu, Lu Zhang, Huchuan Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces 3UR-LLM, an end-to-end multimodal large language model that effectively interprets 3D scenes by leveraging high-quality pre-training data and novel architecture components, advancing 3D scene understanding capabilities.

Contribution

The paper presents 3UR-LLM, a novel 3D multimodal LLM that directly processes point clouds and introduces a 3D compressor, improving performance and efficiency over previous models.

Findings

01

Exceeds previous SOTA by 7.1% CIDEr on ScanQA

02

Uses fewer training resources than prior models

03

Constructs a new 3DS-160K benchmark dataset

Abstract

Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160K , to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hmxiong/3ur-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Human Motion and Animation

MethodsSparse Evolutionary Training