Imagine a City: CityGenAgent for Procedural 3D City Generation
Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu, Ka-Hei Hui, Haoran Xie, Bo Dai, Zhengzhe Liu

TL;DR
CityGenAgent is a novel framework that uses natural language and hierarchical procedural methods to generate high-quality, controllable 3D cities with improved semantic and visual fidelity.
Contribution
It introduces a two-stage learning approach combining supervised fine-tuning and reinforcement learning for structured city generation from natural language.
Findings
Outperforms existing methods in semantic alignment and visual quality.
Supports natural language editing and manipulation of 3D cities.
Demonstrates robust generalization in procedural city generation.
Abstract
The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The task of procedural, language-guided 3D city generation is interesting, addresses a crucial challenge in 3D content creation, and is of interest to the community. 2. The paper is generally well-organized and the approach is presented clearly, making the high-level idea easy for the reader to follow and understand.
1. The technical contribution appears somewhat limited. While the paper focuses heavily on the scene description and iterative refinement driven by the LLM, there is a distinct lack of detailed description regarding the actual procedural generation and manipulation of the 3D assets (e.g., buildings, road networks, and underlying geometric operations). For a 3D generation work, the mechanisms for handling 3D assets should be a core component, yet these sections lack sufficient technical detail.
1. The paper introduces a novel hierarchical procedural generation paradigm via two domain-specific languages (Block Program and Building Program), creatively combining LLMs with structured programs as editable proxies. 2. The technical execution is robust, with clear decomposition into BlockGen/BuildingGen modules, SFT+PPO training pipeline, and reward designs grounded in computable metrics, demonstrating better semantic alignment and visual fidelity over mentioned baselines. 3. The paper is
1. The framework in Section 3.1 decomposes cities using Block Program and Building Program as editable DSL intermediates. However, no comparison is provided with scene graph-based 3D scene generation methods, such as in terms of layout fidelity, editing efficiency, or scalability to multi-block cities. 2. In Section 3.2.1, Block-Gen (SFT) is described as enabling the LLM to generate valid Block Programs that adhere to the schema, including non-self-intersecting polygons and required fields. How
The paper proposes to integrate LLMs to guide hierarchical procedural generation in the 3D domain, effectively addressing the traditional challenge of translating abstract intent into concrete geometric parameters. The results show the model can produce better 3D city scenes compared with existing methods.
1. The model efficiency is a critical weakness for a framework targeting real-world applications like large-scale city generation, which demand high efficiency. Given the reliance on commercial LLM APIs, the token consumption for a complete, complex city generation is likely prohibitive. The paper fails to provide a quantitative scaling analysis (e.g., generation time vs. area/asset count) or propose concrete technical solutions (beyond simple statements) to mitigate the LLM-related computationa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Human Motion and Animation
