3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Jiajun Deng; Tianyu He; Li Jiang; Tianyu Wang; Feras Dayoub; and Ian; Reid

arXiv:2501.01163·cs.CV·April 25, 2025

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, and Ian, Reid

PDF

Open Access

TL;DR

3D-LLaVA introduces a minimalist, powerful 3D multimodal model with an innovative Omni Superpoint Transformer that enhances scene understanding and human interaction using only point clouds.

Contribution

It presents the Omni Superpoint Transformer, a novel architecture that simplifies 3D LMMs and improves their ability to understand and interact with 3D environments.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Demonstrates effective 3D scene understanding and reasoning.

03

Simplifies 3D LMM architecture with integrated design.

Abstract

Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Measurement and Metrology Techniques · Photonic and Optical Devices · Advanced Fiber Optic Sensors

MethodsByte Pair Encoding · Linear Layer · Softmax · Dense Connections · Attention Is All You Need · Absolute Position Encodings · Dropout · Adam · Residual Connection · Multi-Head Attention