SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis
Jeongjun Choi, Yeonsoo Park, H. Jin Kim

TL;DR
SceneNAT introduces a masked non-autoregressive Transformer that efficiently generates complete 3D indoor scenes from natural language, outperforming previous methods in accuracy and computational efficiency.
Contribution
The paper proposes SceneNAT, a novel masked modeling approach with a triplet predictor for improved scene synthesis from language instructions.
Findings
Outperforms state-of-the-art methods in semantic and spatial accuracy
Operates with significantly lower computational cost
Effectively captures intra-object and inter-object relationships
Abstract
We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves…
Peer Reviews
Decision·Submitted to ICLR 2026
+The paper's core insight—that a Non-Autoregressive Transformer is a powerful alternative to AR or Diffusion models for this task—is a significant strength. This approach opens a promising third direction for high-speed, high-quality structured 3D generation. +The concept of the Triplet Predictor is also a key strength. Decoupling symbolic relation-understanding from the geometric generation task is an intelligent design choice. +The paper shows impressive performance and efficiency via compre
-A key concern is the reliance on templated data. The training instructions are synthetically generated from structured relations. This implies the model is primarily learning to "invert" this synthetic generation process rather than parsing free-form human language directly. It remains unclear how the model would generalize to ambiguous, "in-the-wild" human instructions that do not follow the rigid structure. Therefore, the claim of handling "complex instructions" may be somewhat overstated.
1. The paper introduces a masked non-autoregressive Transformer for language-guided 3D scene synthesis, augmented with an explicit triplet-based relational module. This design is well-motivated and yields favorable decoding efficiency relative to conventional autoregressive or diffusion approaches. 2. On synthetic benchmarks, the method attains strong results on instruction adherence (iRecall), perceptual realism (FID), and object-level recall, while maintaining fast inference with the non-auto
1. The comparisons largely stop at pre-2025 methods and omit recent state-of-the-art systems (e.g., ReSpace [1]) 2. The paper lacks qualitative or quantitative diagnostics (e.g., relation violations, object collisions, discretization artifacts and so on). 3. All results are on synthetic data; there is no real-world study (assets/layouts) or human evaluation, so external validity and deployment readiness remain unclear. 4. The manuscript does not examine multi-room layouts, longer relational prom
- **Novelty**. The method introduces a tailored non-autoregressive Transformer (NAT) for 3D scene synthesis, using dual-granularity masked modeling (attribute- and instance-level) to capture intra- and inter-object dependencies. The decoupled triplet predictor improves spatial relation modeling, overcoming limitations of traditional text representations. - **Comprehensive Experiments**. The experiments are thorough, including quantitative metrics (iRecall, FID/CLIP-FID/KID) and qualitative comp
1. **Limited Generalization to Complex Relational Scenarios**. This work limits the maximum number of relational constraints per instruction to 4, citing the token length limit of the CLIP text encoder. However, it does not address how the model would scale to more complex and realistic design scenarios. Additionally, the paper only validates 11 predefined spatial relations (e.g., "right of", "above") from InstructScene, missing common fine-grained or ambiguous relations, and still faces the iss
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
