AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer
Lingting Zhu, Shengju Qian, Haidi Fan, Jiayu Dong, Zhenchao Jin, Siwei Zhou, Gen Dong, Xin Wang, Lequan Yu

TL;DR
AssetFormer is a novel autoregressive Transformer model that generates modular 3D assets from text descriptions, improving quality and flexibility for user-generated content and professional use.
Contribution
It introduces a new Transformer-based framework for modular 3D asset generation from text, with innovative sequencing and decoding techniques inspired by language models.
Findings
Effective in generating high-quality modular 3D assets
Streamlines asset creation for UGC and professional development
Flexible framework extendable to various 3D asset types
Abstract
The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets,…
Peer Reviews
Decision·ICLR 2026 Poster
- Encouraging results. - Theoretically well-written paper
- Can have more ablations. - Can be good for the paper to have more expressive examples.
- The use of large language models (LLMs) to generate sequential, modular 3D primitives enhances both efficiency and interpretability. - If released, the dataset could be a valuable contribution to the community, addressing the current gap in large-scale modular 3D assets. - The ability to directly integrate the generated 3D models in game engines unlocks numerous real-world applications. - AssetFormer demonstrates strong generation quality through its proposed techniques, including token orderi
- The paper is similar in spirit to existing approaches [1,2] that utilize different 3D primitives with autoregressive transformer models for sequential 3D asset generation. While the method is adapted to a homestead-specific dataset, the core technical contribution appears incremental. Additionally, the paper would benefit from including and discussing these relevant prior works. - Quantitative evaluation is limited, with comparisons restricted to MeshGPT. Broader benchmarking against other LLM
1: The decision to focus on modular assets is highly practical and directly addresses key pain points in UGC and game development. This representation is efficient, easy to edit, and perfectly suited for an autoregressive framework. 2: The creation and combination of a procedural dataset and a real-world user-generated dataset is a major strength. The ablation study convincingly shows that this hybrid approach is superior to using either source alone, providing both structure and diversity. Th
1: The model relies on a fixed, discrete vocabulary for primitive types, positions, and rotations. This fundamentally limits its creative potential to the predefined set of components and a grid-based layout. It cannot generate novel primitive shapes or place them with continuous precision, which will be a constraint for more organic or complex designs. 2: The method is demonstrated on buildings composed from a specific set of architectural parts. It is unclear how well the approach would scale
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Interactive and Immersive Displays · Human Motion and Animation
