Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models
Hugues Thomas, Chen Chen, Jian Zhang

TL;DR
This paper systematically compares 2D and 3D tokenization methods for 3D scene understanding in multimodal large language models, introducing a novel 3D token enrichment approach that achieves state-of-the-art results.
Contribution
It provides a comprehensive analysis of 3D token structures and proposes a new method that incorporates 3D point cloud features to enhance model performance.
Findings
3D point cloud features significantly improve performance.
Point-based tokens can match video-based tokens with proper sampling.
State-of-the-art results achieved on multiple benchmarks.
Abstract
Effectively representing 3D scenes for Multimodal Large Language Models (MLLMs) is crucial yet challenging. Existing approaches commonly only rely on 2D image features and use varied tokenization approaches. This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations while maintaining consistent model backbones and parameters. We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder. Our experiments demonstrate that merging explicit 3D features significantly boosts performance. Furthermore, we show that point-based token structures can rival video-based ones when the points are cleverly sampled and ordered. Our best models from both structures achieve state-of-the-art results on multiple 3D understanding benchmarks. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
