CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

Kaixin Yao; Longwen Zhang; Xinhao Yan; Yan Zeng; Qixuan Zhang; Wei Yang; Lan Xu; Jiayuan Gu; Jingyi Yu

arXiv:2502.12894·cs.CV·May 14, 2025

CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, Jingyi Yu

PDF

Open Access

TL;DR

CAST is a novel method that reconstructs high-quality 3D scenes from a single RGB image by integrating object segmentation, spatial relationship analysis, occlusion-aware generation, and physics-based correction for realistic and coherent scene modeling.

Contribution

The paper introduces CAST, a comprehensive framework combining GPT-based analysis, large-scale 3D generation, and physics-aware optimization for improved 3D scene reconstruction from a single image.

Findings

01

Effective occlusion handling with Signed Distance Fields

02

Accurate object alignment and scene coherence

03

Enhanced physical realism in reconstructed scenes

Abstract

Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction and recovery. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relationships. This enables the understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object's full geometry, using MAE and point cloud conditioning to mitigate the effects of occlusions and partial object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Industrial Vision Systems and Defect Detection · Medical Image Segmentation Techniques

MethodsMasked autoencoder · ALIGN