ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction
Xinyi Zhang, Daoyi Gao, Naiqi Li, Angela Dai

TL;DR
ProcGen3D introduces a neural procedural graph-based method for 3D reconstruction from images, leveraging transformer models and MCTS to produce detailed, domain-specific 3D assets that outperform existing techniques.
Contribution
The paper presents a novel graph-based procedural representation and a transformer-based generative model with MCTS-guided sampling for improved image-to-3D reconstruction.
Findings
Outperforms state-of-the-art 3D generative methods
Enables better generalization to real-world images
Effective across diverse object categories like cacti, trees, and bridges
Abstract
We introduce ProcGen3D, a new approach for 3D content creation by generating procedural graph abstractions of 3D objects, which can then be decoded into rich, complex 3D assets. Inspired by the prevalent use of procedural generators in production 3D applications, we propose a sequentialized, graph-based procedural graph representation for 3D assets. We use this to learn to approximate the landscape of a procedural generator for image-based 3D reconstruction. We employ edge-based tokenization to encode the procedural graphs, and train a transformer prior to predict the next token conditioned on an input RGB image. Crucially, to enable better alignment of our generated outputs to an input image, we incorporate Monte Carlo Tree Search (MCTS) guided sampling into our generation process, steering output procedural graphs towards more image-faithful reconstructions. Our approach is applicable…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Using a transformer model to learn the procedural graph generation is interesting and allows more compact representation. 2. MCTS-guided sampling is novel and experimentally performs better for complex local geometry. 3. The approach seems to generalize for real images based on qualitative evaluations.
1. The approach is limited to a single-view image which is not sufficient for capturing the full geometry of the object. 2. The overall generation will be limited by the limitations of autoregressive models, i.e., overall structure will be limited by the order of generation and errors can easily propagate. There is no discussion in the paper on the robustness of the generation process to small early mistakes. 3. The overall dataset is very limited with just three categories. The set of real-imag
- As far as I know, this is the first work on image-conditioned 3D generation using procedural graphs as the 3D representation. Procedural graphs have many advantages, such as generality and being able to represent details well and succintly. - The construction of the Transformer-based generative model is sound. - Experiments show good qualitative results.
The weaknesses of the paper fall into two main points. First, I think that the experiments are incomplete. - The model is trained separately on each category of objects. This calls the experimental results' generality into question. Was training done on a more diverse dataset? - Wonder3D and TRELLIS were both trained on a diverse set of objects. Therefore, comparing to them in a category-specific manner is not quite fair to them. It would strengthen the paper to include another category-specific
Innovative Representation: The idea of learning procedural graphs as the latent 3D representation is novel and conceptually elegant. It bridges neural generative modeling and procedural graphics in a meaningful way. Compact & Interpretable Outputs: Procedural graphs are lightweight and structured, providing interpretable intermediate representations rather than opaque neural fields or dense meshes. Effective Image Alignment: The use of MCTS-guided sampling for test-time refinement is an origin
Missing comparison: The paper doesn't compare to DI-PCG, which I believe is a very important, relevant baseline Limited real world examples: Although the method claims to generalize to real world examples, number of results for the same is limited. Limited Scope of Objects: The evaluated categories—trees, cacti, bridges—are all graph-structured and hierarchical. It’s unclear how well the method extends to more complex or amorphous shapes (e.g., vehicles, furniture). Computational Overhead of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques
