Aligning Text, Images, and 3D Structure Token-by-Token

Aadarsh Sahoo; Vansh Tibrewal; Georgia Gkioxari

arXiv:2506.08002·cs.CV·January 7, 2026

Aligning Text, Images, and 3D Structure Token-by-Token

Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari

PDF

Open Access

TL;DR

This paper introduces a unified autoregressive model that aligns language, images, and 3D scenes, enabling improved understanding and reconstruction of complex 3D environments from various data modalities.

Contribution

It presents a novel framework for tokenizing and modeling 3D scenes within a language model architecture, addressing key challenges in multimodal 3D understanding.

Findings

01

Effective 3D scene reconstruction from a single image

02

Improved 3D object recognition on real-world datasets

03

Versatile performance across multiple 3D tasks

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Handwritten Text Recognition Techniques