3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
Haomiao Xiong, Yunzhi Zhuge, Jiawen Zhu, Lu Zhang, Huchuan Lu

TL;DR
This paper introduces 3UR-LLM, an end-to-end multimodal large language model that effectively interprets 3D scenes by leveraging high-quality pre-training data and novel architecture components, advancing 3D scene understanding capabilities.
Contribution
The paper presents 3UR-LLM, a novel 3D multimodal LLM that directly processes point clouds and introduces a 3D compressor, improving performance and efficiency over previous models.
Findings
Exceeds previous SOTA by 7.1% CIDEr on ScanQA
Uses fewer training resources than prior models
Constructs a new 3DS-160K benchmark dataset
Abstract
Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160K , to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Human Motion and Animation
MethodsSparse Evolutionary Training
