SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, Xingang Pan

TL;DR
SAR3D introduces a multi-scale autoregressive framework for 3D object generation and understanding, significantly improving speed and quality while enabling LLMs to interpret 3D models effectively.
Contribution
The paper presents SAR3D, a novel multi-scale 3D VQVAE-based autoregressive model that accelerates 3D generation and enhances understanding with hierarchical 3D-aware tokens.
Findings
Achieves 0.82 seconds 3D generation on A6000 GPU
Surpasses existing methods in speed and quality
Enables LLMs to interpret and caption 3D models
Abstract
Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
