Aligning Text, Images, and 3D Structure Token-by-Token
Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari

TL;DR
This paper introduces a unified autoregressive model that aligns language, images, and 3D scenes, enabling improved understanding and reconstruction of complex 3D environments from various data modalities.
Contribution
It presents a novel framework for tokenizing and modeling 3D scenes within a language model architecture, addressing key challenges in multimodal 3D understanding.
Findings
Effective 3D scene reconstruction from a single image
Improved 3D object recognition on real-world datasets
Versatile performance across multiple 3D tasks
Abstract
Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Handwritten Text Recognition Techniques
