TL;DR
TRELLISWorld introduces a training-free, modular approach to generate large, coherent 3D scenes from text prompts by repurposing object diffusion models, enabling scalable and flexible scene creation without retraining.
Contribution
It reformulates scene generation as a multi-tile denoising problem using object diffusion models, eliminating the need for scene-level training or datasets.
Findings
Supports diverse scene layouts
Enables efficient scene synthesis
Allows flexible editing of 3D scenes
Abstract
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We…
Peer Reviews
Decision·Submitted to ICLR 2026
The core idea of using tiled diffusion with cosine blending to smoothen the inter-tile transition is straightforward with easy-to-understand intuition. The method description is clear and the implementation provides some details, though it's doubtful if it's sufficient for readers to reperform w/o open-sourced codes. The results show clear advantages over the peering work Syncity. The limitation section acknowledges its base-model dependence and lack of object disentanglement.
As mentioned in the strength, the method is quite straightforward, therefore the impact heavily lies in the provision of the tool as opensourced code to the community, as SynCity has done. The innovative contribution is more an incremental improvement of Trellis, thus whether it meets the standard as a standalone paper in ICLR may need further discussion. The work is heavily depending on the base object generator, which limits the contribution. The comparison is mainly against SynCity while
1. State-of-the-Art Results The proposed TRELLISWorld achieves superior CLIP score performance compared to the recent state-of-the-art method SynCity, while also requiring less computational resources and delivering faster inference speed. This demonstrates the efficiency and scalability of the training-free design. 2. Comprehensive Ablation Studies The authors present comprehensive qualitative ablation studies on key components—Tiled Diffusion, Blending, and Tiled Decoder—clearly illustrating
1. Heavy Reliance on the Base Model As acknowledged in the manuscript, the proposed method—being training-free—is inherently limited by the capabilities of its underlying base model, TRELLIS. Consequently, the overall performance and generalization ability are closely tied to the pretrained model’s strengths and weaknesses, which may restrict the method’s applicability across diverse domains. 2. Lack of Quantitative Ablation Studies While the qualitative ablation studies provide valuable insigh
Overall, the paper introduces a training-free method to generate 3d world using a 3D object generator. It provides a simple and effective method to achieve meaningfull applications.
The originality of the paper is somehow limited. The article claims the difference between them and syncity is that syncity depends on image inpainting, but this is merely a difference in the conditional mechanism. Aside from this conditional mechanism, the overall pipeline, which involves generating tiles and then blending, is very similar. Furthermore, the mechanisms of tile diffusion and blending mentioned in the paper are very similar to those of MultiDiffusion [1], and I haven't seen any ef
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
