SAR3D: Autoregressive 3D Object Generation and Understanding via   Multi-scale 3D VQVAE

Yongwei Chen; Yushi Lan; Shangchen Zhou; Tengfei Wang; Xingang Pan

arXiv:2411.16856·cs.CV·March 25, 2025

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, Xingang Pan

PDF

Open Access 1 Datasets

TL;DR

SAR3D introduces a multi-scale autoregressive framework for 3D object generation and understanding, significantly improving speed and quality while enabling LLMs to interpret 3D models effectively.

Contribution

The paper presents SAR3D, a novel multi-scale 3D VQVAE-based autoregressive model that accelerates 3D generation and enhances understanding with hierarchical 3D-aware tokens.

Findings

01

Achieves 0.82 seconds 3D generation on A6000 GPU

02

Surpasses existing methods in speed and quality

03

Enables LLMs to interpret and caption 3D models

Abstract

Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Yong-Hoon/sar3d-dataset
dataset· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Neural Network Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings