MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan,, Gang Zeng, Siyuan Huang

TL;DR
MOVIS significantly improves multi-object novel view synthesis in indoor scenes by enhancing structural understanding, auxiliary object mask prediction, and a structure-guided training strategy, leading to more consistent and accurate multi-object rendering.
Contribution
The paper introduces MOVIS, a novel framework that incorporates structure-aware features, auxiliary tasks, and a specialized training scheduler to improve multi-object NVS.
Findings
Enhanced cross-view consistency in multi-object scenes.
Improved object placement accuracy under novel views.
Strong generalization demonstrated on synthetic and real datasets.
Abstract
Repurposing pre-trained diffusion models has been proven to be effective for NVS. However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model's comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The incorporation of structure-aware features, such as depth and object masks, enhances the model's ability to understand complex spatial relationships in multi-object scenarios. - The introduction of an auxiliary task for predicting novel view object masks demonstrates a thoughtful approach to improving model performance.
- The paper propose a new scheduler, the structure-guided timestep sampling scheduler. This new scheduler is motivated by [1] and further improve it by, not using a fixed variance when sampling the timestep value from a gaussian distribution, but use a linear decay from variance = 1000 to variance = 500. I have two concerns with this new strategy: 1) While this idea sounds more reasonable than [1], I did not find an ablation study to show it better than [1]. Table 2 seems to have some numbers
1. Targeting multi-object NVS is interesting, as most current NVS methods are validated on single-object scenarios. By addressing challenges like correct object placement, shape, and appearance across views, this direction opens up new possibilities for compositional scene generation. 2. Using corresponding points as a validation metric is a valuable addition, as it provides a more robust measure of cross-view consistency. Unlike image-level metrics alone, corresponding points offer a way to d
1. The paper lacks commonly adopted NVS evaluation metrics that directly assess multi-view consistency, such as running 3D reconstruction (like nerf) to compute mesh differences (e.g., Chamfer distance) or re-rendering metrics like SSIM or LPIPS. These approaches provide a more direct evaluation of the generated multi-view consistency by quantifying structural and perceptual alignment across views. Without these metrics, it’s harder to objectively compare the quality of synthesized outputs. 2.
- MOVIS integrates structure-aware features and an auxiliary task for novel view mask prediction, improving the model’s comprehension of spatial relationships and accurate object placement in multi-object scenarios. - MOVIS demonstrates generalization across unseen datasets and employs a structure-guided timestep sampling scheduler that balances global object placement with fine-grained detail recovery, enhancing overall synthesis quality.
- This paper has significant issues in presenting results. Although the paper claims to achieve consistent multi-view synthesis for multiple objects, the results do not support this claim. - For example, in Figure 4, the third input (Refer to row 3 column 1 / and row 3 column 3) shows an orange and a yellow pillow, which are distinctly different in color. However, the output shows two yellow pillows being generated. Additionally, the sofa in the generated result includes unfounded noise (on th
From the paper, it includes following pros. - The method provide Enhanced Spatial Understanding: Incorporating depth and object mask data enhances spatial awareness, critical in multi-object scenes. - It Improves Cross-View Consistency: The model demonstrates strong consistency across views, validated by high Hit Rate and lower matching distance on datasets like Objaverse. - It is Generalized to Unseen Data: MOVIS shows strong adaptability across datasets, including synthetic and real-world sce
However, there are few problems, first it incorporating additional structure-aware inputs could require independent models for different modalities, including a depth estimation network or a segmentation network. In the paper, it seems the ground truth has been Also, the experimenting datasets are mostly synthetic data, while it might also worth to check the performance on real object dataset or random generated images, like MVImagenet etc. For the Scheduler Strategy, it seems the strategy i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
MethodsDiffusion · Max Pooling · Convolution · Concatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · U-Net
