AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

Lingting Zhu; Shengju Qian; Haidi Fan; Jiayu Dong; Zhenchao Jin; Siwei Zhou; Gen Dong; Xin Wang; Lequan Yu

arXiv:2602.12100·cs.CV·February 13, 2026

AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

Lingting Zhu, Shengju Qian, Haidi Fan, Jiayu Dong, Zhenchao Jin, Siwei Zhou, Gen Dong, Xin Wang, Lequan Yu

PDF

Open Access 1 Models 3 Reviews

TL;DR

AssetFormer is a novel autoregressive Transformer model that generates modular 3D assets from text descriptions, improving quality and flexibility for user-generated content and professional use.

Contribution

It introduces a new Transformer-based framework for modular 3D asset generation from text, with innovative sequencing and decoding techniques inspired by language models.

Findings

01

Effective in generating high-quality modular 3D assets

02

Streamlines asset creation for UGC and professional development

03

Flexible framework extendable to various 3D asset types

Abstract

The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Encouraging results. - Theoretically well-written paper

Weaknesses

- Can have more ablations. - Can be good for the paper to have more expressive examples.

Reviewer 02Rating 4Confidence 4

Strengths

- The use of large language models (LLMs) to generate sequential, modular 3D primitives enhances both efficiency and interpretability. - If released, the dataset could be a valuable contribution to the community, addressing the current gap in large-scale modular 3D assets. - The ability to directly integrate the generated 3D models in game engines unlocks numerous real-world applications. - AssetFormer demonstrates strong generation quality through its proposed techniques, including token orderi

Weaknesses

- The paper is similar in spirit to existing approaches [1,2] that utilize different 3D primitives with autoregressive transformer models for sequential 3D asset generation. While the method is adapted to a homestead-specific dataset, the core technical contribution appears incremental. Additionally, the paper would benefit from including and discussing these relevant prior works. - Quantitative evaluation is limited, with comparisons restricted to MeshGPT. Broader benchmarking against other LLM

Reviewer 03Rating 4Confidence 3

Strengths

1: The decision to focus on modular assets is highly practical and directly addresses key pain points in UGC and game development. This representation is efficient, easy to edit, and perfectly suited for an autoregressive framework. 2: The creation and combination of a procedural dataset and a real-world user-generated dataset is a major strength. The ablation study convincingly shows that this hybrid approach is superior to using either source alone, providing both structure and diversity. Th

Weaknesses

1: The model relies on a fixed, discrete vocabulary for primitive types, positions, and rotations. This fundamentally limits its creative potential to the predefined set of components and a grid-based layout. It cannot generate novel primitive shapes or place them with continuous precision, which will be a constraint for more organic or complex designs. 2: The method is demonstrated on buildings composed from a specific set of architectural parts. It is unclear how well the approach would scale

Code & Models

Models

🤗
ltzhu/AssetFormer
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Interactive and Immersive Displays · Human Motion and Animation