ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Junliang Ye; Zhengyi Wang; Ruowen Zhao; Shenghao Xie; Jun Zhu

arXiv:2506.01853·cs.CV·June 3, 2025

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

ShapeLLM-Omni is a pioneering 3D multimodal large language model that can understand and generate 3D assets and text, expanding multimodal AI beyond images and text.

Contribution

It introduces a 3D-aware vector-quantized autoencoder and a large-scale 3D dataset, enabling instruction-based training for 3D content understanding and generation.

Findings

01

Effective 3D shape representation and reconstruction.

02

Creation of the 3D-Alpaca dataset for training.

03

Demonstrated 3D understanding and generation capabilities.

Abstract

Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jamesyjl/shapellm-omni
pytorchOfficial

Models

Datasets

yejunliang23/3D-Alpaca
dataset· 94 dl
94 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · 3D Surveying and Cultural Heritage · Human Motion and Animation