3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, and Ian, Reid

TL;DR
3D-LLaVA introduces a minimalist, powerful 3D multimodal model with an innovative Omni Superpoint Transformer that enhances scene understanding and human interaction using only point clouds.
Contribution
It presents the Omni Superpoint Transformer, a novel architecture that simplifies 3D LMMs and improves their ability to understand and interact with 3D environments.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Demonstrates effective 3D scene understanding and reasoning.
Simplifies 3D LMM architecture with integrated design.
Abstract
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Measurement and Metrology Techniques · Photonic and Optical Devices · Advanced Fiber Optic Sensors
MethodsByte Pair Encoding · Linear Layer · Softmax · Dense Connections · Attention Is All You Need · Absolute Position Encodings · Dropout · Adam · Residual Connection · Multi-Head Attention
