VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

Zongcheng Han; Dongyan Cao; Haoran Sun; Yu Hong

arXiv:2602.13818·cs.CV·February 17, 2026

VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

Zongcheng Han, Dongyan Cao, Haoran Sun, Yu Hong

PDF

Open Access

TL;DR

VAR-3D introduces a view-aware auto-regressive model with a 3D tokenizer and rendering supervision, significantly improving text-to-3D generation quality and structural fidelity over previous methods.

Contribution

The paper presents a novel view-aware auto-regressive model with a 3D tokenizer and a rendering-supervised training strategy for improved text-to-3D generation.

Findings

01

Outperforms existing methods in generation quality

02

Achieves better text-3D alignment

03

Enhances structural and visual fidelity

Abstract

Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Human Motion and Animation