SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Jeongjun Choi; Yeonsoo Park; H. Jin Kim

arXiv:2601.07218·cs.CV·January 13, 2026

SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Jeongjun Choi, Yeonsoo Park, H. Jin Kim

PDF

Open Access 3 Reviews

TL;DR

SceneNAT introduces a masked non-autoregressive Transformer that efficiently generates complete 3D indoor scenes from natural language, outperforming previous methods in accuracy and computational efficiency.

Contribution

The paper proposes SceneNAT, a novel masked modeling approach with a triplet predictor for improved scene synthesis from language instructions.

Findings

01

Outperforms state-of-the-art methods in semantic and spatial accuracy

02

Operates with significantly lower computational cost

03

Effectively captures intra-object and inter-object relationships

Abstract

We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

+The paper's core insight—that a Non-Autoregressive Transformer is a powerful alternative to AR or Diffusion models for this task—is a significant strength. This approach opens a promising third direction for high-speed, high-quality structured 3D generation. +The concept of the Triplet Predictor is also a key strength. Decoupling symbolic relation-understanding from the geometric generation task is an intelligent design choice. +The paper shows impressive performance and efficiency via compre

Weaknesses

-A key concern is the reliance on templated data. The training instructions are synthetically generated from structured relations. This implies the model is primarily learning to "invert" this synthetic generation process rather than parsing free-form human language directly. It remains unclear how the model would generalize to ambiguous, "in-the-wild" human instructions that do not follow the rigid structure. Therefore, the claim of handling "complex instructions" may be somewhat overstated.

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper introduces a masked non-autoregressive Transformer for language-guided 3D scene synthesis, augmented with an explicit triplet-based relational module. This design is well-motivated and yields favorable decoding efficiency relative to conventional autoregressive or diffusion approaches. 2. On synthetic benchmarks, the method attains strong results on instruction adherence (iRecall), perceptual realism (FID), and object-level recall, while maintaining fast inference with the non-auto

Weaknesses

1. The comparisons largely stop at pre-2025 methods and omit recent state-of-the-art systems (e.g., ReSpace [1]) 2. The paper lacks qualitative or quantitative diagnostics (e.g., relation violations, object collisions, discretization artifacts and so on). 3. All results are on synthetic data; there is no real-world study (assets/layouts) or human evaluation, so external validity and deployment readiness remain unclear. 4. The manuscript does not examine multi-room layouts, longer relational prom

Reviewer 03Rating 6Confidence 4

Strengths

- **Novelty**. The method introduces a tailored non-autoregressive Transformer (NAT) for 3D scene synthesis, using dual-granularity masked modeling (attribute- and instance-level) to capture intra- and inter-object dependencies. The decoupled triplet predictor improves spatial relation modeling, overcoming limitations of traditional text representations. - **Comprehensive Experiments**. The experiments are thorough, including quantitative metrics (iRecall, FID/CLIP-FID/KID) and qualitative comp

Weaknesses

1. **Limited Generalization to Complex Relational Scenarios**. This work limits the maximum number of relational constraints per instruction to 4, citing the token length limit of the CLIP text encoder. However, it does not address how the model would scale to more complex and realistic design scenarios. Additionally, the paper only validates 11 predefined spatial relations (e.g., "right of", "above") from InstructScene, missing common fine-grained or ambiguous relations, and still faces the iss

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis