TL;DR
This paper introduces ControlCity, a multimodal diffusion model that synthesizes urban morphology by integrating images, text, and metadata, significantly improving realism and controllability over traditional geometric methods.
Contribution
The study presents a novel multimodal diffusion framework for urban morphology generation, combining spatial, semantic, and geographical data for more accurate and controllable urban simulations.
Findings
71.01% reduction in visual error (FID)
38.46% improvement in spatial overlap (MIoU)
Enables cross-city style transfer and zero-shot generation
Abstract
Urban morphology is fundamental to determining urban functionality and vitality. Prevailing simulation methods, however, often oversimplify morphological generation as a geometric problem, lacking a profound understanding of urban semantics and geographical context. To address this limitation, this study proposes ControlCity, a diffusion model that achieves comprehensive urban morphology generation through multimodal information fusion. We first constructed a quadruple dataset comprising ``image-text-metadata-building footprints" from 22 cities worldwide. ControlCity utilizes these multidimensional information as joint control conditions, where an enhanced ControlNet architecture encodes spatial constraints from images, while text and metadata provide semantic guidance and geographical priors respectively, collectively directing the generation process. Experimental results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · ALIGN
