RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

Shengyuan Wang; Zhiheng Zheng; Yu Shang; Lixuan He; Yangcheng Yu; Fan Hangyu; Jie Feng; Qingmin Liao; Yong Li

arXiv:2511.18005·cs.CV·November 25, 2025

RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

Shengyuan Wang, Zhiheng Zheng, Yu Shang, Lixuan He, Yangcheng Yu, Fan Hangyu, Jie Feng, Qingmin Liao, Yong Li

PDF

Open Access

TL;DR

RAISECity is a novel multimodal agent framework that generates detailed, city-scale 3D worlds with high fidelity and realism, addressing previous limitations in quality and scalability for applications like immersive media and embodied intelligence.

Contribution

The paper introduces RAISECity, a new agentic framework leveraging multimodal tools for scalable, high-quality, reality-aligned 3D city-scale world generation, with iterative refinement and robust representations.

Findings

01

Achieves over 90% win-rate in perceptual quality benchmarks.

02

Outperforms existing methods in shape precision, texture fidelity, and aesthetics.

03

Demonstrates scalability and compatibility with graphics pipelines.

Abstract

City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a \textbf{R}eality-\textbf{A}ligned \textbf{I}ntelligent \textbf{S}ynthesis \textbf{E}ngine that creates detailed, \textbf{C}ity-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Human Motion and Animation · Interactive and Immersive Displays